AD  A 0 4 5 5 4 4 


A METHODOLOGY  FOR  DATA  BASE  DESIGN  IN  A PAGING  ENVIRONMENT 
The  University  of  Michigan 


ROME  AM  DP'EIOPMENT  CENTER 

Air  Fore*  Sy»t**m  Commend 

Griffin  Air  Fore*  Bat*,  N*w  York  13441 


This  report  has  been  reviewed  by  the  RADC  Information  Office  (01) 
and  is  releasable  to  the  National  Technical  Information  Service  (NTIS) . 
At  NTIS  it  will  be  releasable  to  the  general  public,  including  foreign 
nations. 


This  report  has  been  reviewed  and  is  approved  for  publication. 


APPROVED: 


DONALD  M.  ELEFANTE 
Project  Engineer 


APPROVED: 


Md.kJx 

ROBERT  D.  KRUTZ,  Colonel,  USAF 
Chief,  Information  Sciences  Division 


Acting  Chief,  Plans  Office 


If  your  address  has  changed  or  if  you  wish  to  be  removed  from  the  RADC 
mailing  list,  or  if  the  addressee  is  no  longer  employed  by  your ' organisation, 
please  notify  RADC  (DAP)  Grlffiss  AFB  NY  13441.  This  will  assist  us  in 
maintaining  a current  mailing  list. 

Do  not  return  this  copy.  Retain  or  destroy. 


UNCLASSIFIED 


IFICATION  OF  This  PACE  (Whmn  Dmf  F.nfrod) 


q 


REPORT  DOCUMENTATION  PAGE 


12.  GOVT  ACCESSION  NO 


* title  (■•nd  Sufcmi.j 


_A  METHODOLOGY  FOR  DATA  BASE  DESIGN  IN  A PAGING 

'environment  » * » 

* i 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


.ipient's  catalog  number 


7PI  U»  CsPlSBTlp  PERIOO  COHERED 

Interim  Xet 
1 Oct  7! 


N/A 


"T 


1 May  77* 

jpgBJr NU 


7 ■ AlU 


Elias  ^ereli 
Keki  B.^ran 


an  / 

u 


8.  CONTRACT  OR  GRANT  NUMBER^) 

F#S^2-76-C-fo)29  J 


9 PERFORMING  ORGANIZATION  NAME  AND  AOORESS 

University  of  Michigan/J)£partment  of  Computer  & 
Electrical  Engineering  Systems  Engineering  Lab 
Ann  Arbor  MI  48104 


10.  PROGRAM  ELEMENT.  PROJECT.  TASK 
AREA  A WORK J “ " 


11.  CONTROLLING  OFFICE  NAME  ANO  ADDRESS 

Rome  Air  Development  Center  (ISFO) 
Griffiss  AFB  NY  13441 


13.  NUMBER  OF  PAGES 

285 


71  MONITORING  AGENCY  NAME  » ADDRESS^/  dlllaronl  from  Contr 


15  SECURITY  Class,  (ol  thla  rmpott) 

UNCLASSIFIED 


15*.  DECLASSIFICATION  DOWNGRADING 
..  . .SCHEDULE 

N/A 


16  DISTRIBUTION  STATEMENT  (ol  tbit  Roport) 

Approved  for  public  release;  distribution  unlimited. 


17  DISTRIBUTION  STATEMENT  (of  tho  mbttrmct  ortorod  In  Block  20,  II  dllloront  from  Report) 

Same 


i«  supplementary  notes 

RADC  Project  Engineer: 
Don. i Id  Elefante  (ISFO) 


' - • Y WOPOS  ( Continue  on  rovoram  •id*  II  nacaaaary  and  Idantlly  by  block  numbar) 

Computer  Applications  Access-Path 

Computer  Programming  Graph  Grammar 

Relational  Computability  Data  Base  Design 

Relational  Schemata  Paging 


ABSTRACT  (Continue  on  ravaraa  alda  II  nacaaaary  mnd  identity  by  block  nutnbor) 

Report  is  concerned  with  the  development  of  a methodology  for  partially 
automating  the  design  of  the  logical  structure  of  a paged  data  base.  The 
approach  is  to  map  a high-level  description  of  user  requirements  in  terms  of 
individual  data  items  and  binary  relations  into  a set  of  record  structures  and 
record  relationships.  Subject  to  storage  and  security  constraints,  the  goal 
is  toproduce  a minimum  number  of  page  faults  for  a pre-specif led  set  of  user 
activities. 


DD  t'£Tn  1473 


* 


COITION  OF  I NOV  SS  IS  OBSOLETE 


UNCLASSIFIED 


security  classification  of  this  page  (Wh,n  d.i. 


ABSTRACT 


A METHODOLOGY"  FOR  DATA  BASE  DESIGN 
IN  A PAGING  ENVIRONMENT 

by 

Elias  Berelian 
Chairman:  Keki  B.  Irani 

This  research  is  concerned  with  a methodology  for  par- 
tially automating  the  design  of  the  logical  structure  of  a 
paged  data  base.  The  methodology  may  be  used  by  a data  base 
designer  as  a design  tool,  to  investigate  the  space/time 
trade-offs  of  a large  number  of  logical  structures  that 
describe  a given  application.  The  approach  is  to  map  a 
high  level  description  of  user  requirements  in  terms  of 
individual  data  items  and  binary  relations  into  a set 
of  record  structures  and  record  relationships.  Intra-  and 
inter-record  structures  generated  by  this  design  method 
conform  to  the  CODASYL  Data  Base  Task  Group  specifications 
[ 1 ] - The  data  base  is  assumed  to  be  accessed  in  a paging 
environment.  The  mapping  is  such  that  the  resulting  data 
base  structure  is  optimal,  over  a certain  class  of  DBTG 
structures,  in  the  sense  that  it  produces  a minimum  expected 
number  of  page  faults  for  a pre-specif ied  set  of  user  activ- 
ities, subject  to  storage  and  security  constraints. 

A set  of  implementation  alternatives  for  data  base 
relations  has  been  defined  and  analyzed.  The  collection  of 
exactly  one  implementation  for  each  relation  in  a particular 

i 


application  determines  the  overall  structure'  of  the  data 
base  for  the  application.  A two-component  cost  function  is 
developed  for  relation  implementations.  The  first  component 
is  the  storage  cost  and  consists  of  the  storage  space 
required  to  store  data  item  values,  counter  and  pointer 
variables  and  security  locks.  The  second  cost  component  is 
the  time  cost  of  the  relation  implementation  measured  in  the 
number  of  page  faults  expected  to  occur  when  a certain  set 
of  data  manipulation  operations  is  executed.  Time  cost 
functions  are  defined  in  terms  of  a certain  number  of  page 
fault  categories.  Probability  of  a page  fault  in  each 
category  is  computed  from  the.  parameters  describing  the 
application. 

The  task  of  the  optimization  algorithm  is  to  select 
exactly  one  implementation  for  each  relation  such  that  the 
total  time  cost  is  minimized  while  total  storage  cost  is 
bounded  by  some  given  value  and  certain  constraint  condi- 
tions are  satisfied.  The  algorithm  treats  this  problem  as 
a discrete,  multi-stage  decision  process  in  which  stages 
correspond  to  relations.  For  each  relation,  all  acceptable 
implementations  are  first  determined  and  then  a sequence  of 
"undominated  choices"  for  each  stage  is  generated.  The 
next  phase  of  the  decision  process  develops  (k+1) -stage 
"undominaced  solutions"  from  k-stage  undominated  solutions 
and  undominated  choices  of  stage  k+1.  The  sequence  of 
undominated  solutions  of  the  final  stage  gives  a spectrum 
of  optimal  solutions  for  various  storage  space  limits. 


ii 


Properties  of  undominated  choices  and  undominated  solutions 
have  been  formally  investigated  and  exploited  to  improve 
the  efficiency  of  the  optimization  algorithm. 

An  example  application  has  been  constructed  and  the 
results  of  applying  the  data  base  design  methodology  to  the 
example  application  have  been  studied. 

[1]  CODASYL,  "Data  Base  Task  Group,"  April  1971  Report, 
Conference  on  Data  Systems  Languages. 


ill 


ACKNOWLEDGEMENTS 


I would  like  to  thank  Professor  Keki  B.  Irani,  chair- 
man of  my  doctoral  committee,  for  the  invaluable  guidance 
and  suggestions  he  provided  throughout  the  research  leading 
to  this  dissertation.  His  patience,  understanding  and 
friendship  are  equally  appreciated. 

I thank  members  of  my  committee,  Professors  Chuang, 
Rosenthal,  Teorey  and  Volz  for  the  time,  interest  and 
talent  they  devoted  to  the  progress  of  this  research. 
Professor  Rosenthal's  suggestions  v/ere  especially  helpful 
for  Chapter  4. 

I am  also  grateful  to:  Jamshed  Mulla  for  his  help  in 
various  stages  of  programming,  particularly  for  plotting 
programs,  Annette  Sizemore  for  typing  Chapter  4,  and 
Karen  Chapin  for  her  professional  efficiency  in  typing 
the  rest  of  the  dissertation. 

This  research  was  sponsored  primarily  by  the  Rome 
Air  Development  Center  at  Griffiss  Air  Force  Base  under 
Contract  Number  F30602-76-C-0029 . 

Finally,  I thank  my  wife,  Esther,  and  my  family. 

Their  patience  and  understanding  were  invaluable.  The 
continuing  faith,  encouragement  and  support  of  my  parents, 
Mr.  and  Mrs.  Faraj  Berelian,  throughout  my  graduate 
studies,  are  deeply  appreciated. 

Elias  Berelian 
April  1977 

iv 


TABI.E  OF  CONTENTS 


Page 


Chapter 


I INTRODUCTION  1 

1.1  Motivation  and  Objectives  1 

1.2  DBTG  Proposal  8 

II  USER  REQUIREMENTS  MODEL  16 

2.1  Data  Item  Types  and  Data  Item 

Values  16 

2.2  Data  Base  Relations  19 

2.3  User  Activity  29 

2.4  Storage  Space  Parameters  39 

III  IMPLEMENTATION  ALTERNATIVES  AND  COSTS  43 

3.1  Implementation  Alternatives  45 

3.2  Constraint  Conditions  92 

3.3  Storage  Costs  113 

3.4  Time  Costs  123 

IV  DATA  BASE  DESIGN  171 

4.1  Problem  Set  Up  171 

4.2  Undominated  Choices  187 

4.3  Undominated  Solutions  198 

4.4  Data  Base  Design  216 

V RESULTS  223 

5.1  Sensitivity  of  Time  Cost  Functions  223 

5.2  Discussion  of  Resulting  Data  Base 

Designs  234 

5.3  Storage/Time  Trade-Off  252 

5.4  Effects  of  Some  Parameter  Variations  263 

VI  CONCLUSIONS  *\ND  FURTHER  RESEARCH  277 

6.1  Conclusions  277 

6.2  Further  Research  279 


REFERENCES 


284 


LIST  OF  ILLUSTRATIONS 


Figure  Page 

2.1  An  Example  Set  SI  of  24  Data  Items  I 18 

• 1 

2.2  An  Example  Relation  R(A,B) 22 


2.3  An  Example  Set  SR  of  26  Data  Base  Relation  R^  . 24 

2.4  Directed  Graph  Representation  of  SI  and  SR.  . . 27 


2.5  An  Example  Set  SU  of  4 Run  Units 36 

A Listing  of  the  24  Implementation 47 

Fixed  Duplication  of  B under  A 50 

Fixed  Duplication  of  A under  B 50 

3.4  Variable  Duplication  of  B under  A 52 

3.5  Variable  Duplication  of  I.  under  B 52 

3.6  Fixed  Aggregation  of  B under  A 54 

3.7  A Combination  of  Two  Aggregations 54 

3.8  Variable  Aggregation  of  B under  A 59 

3.9  Chain  with  Next  Pointers  Association  of  B 

under  A 59 

3.10  Chain  with  Next  and  Owner  Pointers  Association 

of  B under  A 67 

3.11  Chain  with  Next  and  Prior  Pointers  Association 

of  B under  A 67 

3.12  Chain  with  Next,  Prior  and  Owner  Pointers 

Association  of  B under  A 77 


3.13  Pointer  Array  Association  of  B under  A . . . . 77 

3.14  Pointer  Array  with  Owner  Pointers  Association 


of  B under  A 81 

3.15  Dummy  Record  Association  of  A and  B 87 

3.16  Single  Linkage  of  B under  A 93 

3.17  Double  Linkage  of  A and  B 93 


vi 


Page 


Figure 


3.18  Multi-Level  Record  Array  and  Binary  Tree  Set 

Structures 167 

5.1  Data  Base  Description  for  a Storage  Limit  of 

1.35  M Bytes,  PGSZ  = 2000  Bytes  and  MSPG  = 2.  . 235 

5.2  Data  Base  Description  for  a Storage  Limit  of 

1.3  M Bytes,  PGSZ  = 2000  Bytes  and  MSPG  = 2 . . 240 


5.3  Data  Base  Description  for  a Storage  Limit  of 
1207050  M Bytes,  PGSZ  = 2000  Bytes  and 

MSPG  =2 242 

5.4  Data  Base  Description  for  a Storage  Limit  of 
1227650  M Bytes,  PGSZ  = 2000  Bytes  and 

MSPG  =2 244 

5.5  Data  Base  Description  for  a Storage  Limit  of 
1348250  M Bytes,  PGSZ  = 2000  Bytes  and 

MSPG  = 2 246 

5.6  Data  Base  Description  for  a Storage  Limit  of 
1485350  M Bytes,  PGSZ  = 2000  Bytes  and 

MSPG  = 2 253 

5.7  Storage/Time  Trade-o^f  Graph 256 

5.8  Data  Base  Description  for  a Storage  Limit  of 
1194000  M Bytes,  PGSZ  = 2000  Bytes  and 

MSPG  =2 259 

5.9  Trade-off  Graph  for  PTRS  = 3 264 

5.10  Trade-off  Graphs  for  Different  MSPG  Values 

where  PGSZ  = 2000  Characters 265 

5.11  Trade-off  Graphs  for  Different  MSPG  Values 

where  PGSZ  = 4000  Characters 267 

5.12  Trade-off  Graphs  for  Different  PGSZ  Values 

where  MSPG  = 1 269 

5.13  Trade-off  Graphs  for  Different  PGSZ  Values 

where  MSPG  ~ 2 270 

5.14  Trade-off  Graphs  for  Different  PGSZ  Values 

where  MSPG  = 4 271 

5.15  Trade-off  Graphs  for  Different  PGSZ  Values 

where  MSPG  = 5 272 


vii 


Page 


Figure 

5.16  Trade-off  Graphs  for  Different  MSPG  and  PGSZ 


Values 273 

5.17  Trade-off  Graphs  for  Different  PTRS  Values 

where  MSPG  and  PGSZ  are  fixed 274 


i 


viii 


LIST  OF  TABLES 


Table 

4.1.1 

4.3.1 


5.1 


5.2 


5.3 


5.4 


5.5 

5.6 


5.7 

5.8 


5.9 


Page 

Number  of  Acceptable  Implementations  for  R^ . . 180 

Maximal  Number  of  Unciominate  I Choices  for 


Stage  k 214 

Time  Cost  Error  Estimates  for  MSPG  = 5, 

PGSZ  = 4000  229 

Time  Cost  Error  Estimates  for  MSPG  = 4, 

PGSZ  = 4000  229 

Time  Cost  Error  Estimates  for  MSPC  = 2, 

PGSZ  = 4000  229 

Time  Cost  Error  Estimates  for  MSPG  = 1, 

PGSZ  = 4000  229 

Time  Cost  Error  Estimates  for  MSPG  = 5, 

PGSZ  = 2000  230 

Time  Cost  Error  Estimates  for  MSPG  = 4, 

PGSZ  = 2000  230 

Time  Cost  Error  Estimates  for  MSPG  = 2, 

PGSZ  = 2000  230 

Time  Cost  Error  Estimates  for  MSPG  = 1, 

PGSZ  = 2000  230 

Time  Cost  Error  Estimates  for  MSPG  = 25, 

PGSZ  = 4000  232 


\ 

I 


lx 


LIST  OP  SYMBOLS 

(in  the  order  of  their  appearance) 


Symbol 

CHAR 

Ji 

SI 

NI 

ITNM(i) 
ITSZ (i) 
ITCR(i) 
Rj (A, B) 

a,  b 


Description 
A character  set 
A data  item  type 

The  set  of  data  item  types  of  an  applica- 
tion 

Cardinality  of  SI 
Name  of  I.  c SI 

i 

Size  (number  of  characters)  of  1^  e SI 
Cardinality  of  1^  e SI 

A data  base  relation  between  two  data  item 
types  A e SI  and  B e SI 

Values  of  data  item  types  A,  B respectively 


SR 

NR 


A data  base  relation  without  explicit 
reference  to  component  data  item  types 

The  set  of  data  base  relations  of  an 
application 

Cardinality  of  SR 


ORG(Rj) 
DST(R  ) 


M.  > 

3 


N . > 

3 


Origin  data  item  type  of  relation  R.; 
e.g.  if  Rj  = R j (A, B)  then  ORG (R  j ) =:A 


Destination  data  item  type  of  relation 
e.g.  if  Rj  = Rj (A, B)  then  DST(Rj)  = B 


Cardinality  or 
pairs  <a,b>  e 
b e B) 


e ok  vnumocr  or  oraei 
Rj(A,B),  where  a e A and 


An  ordered  triple  representing  the  mini- 
mum, average  and  maximum  number  of 
B-values  related  to  one  A-value  in  R., 
respectively  ^ 

An  ordered  triple  representing  the  mini- 
mum, average  and  maximum  number  of 
A-values  related  to  one  B-value  in  R., 
respectively 


x 


Description 


Cardinalities  of  data  item  types 
A = ORG(Rj)  and  B = DST (R^ ), respectively 

Sizes  of  data  item  types  A = ORG ( R . ) and 
B = DST(R  ) , respectively  3 

Number  of  A(B) -values  related  to  at  least 
one  B (A) -value  in  R^ 

Right  image  of  a e A in  R • (A, B) , i.e. 

(b  e DST  (Rj ) | < a,  b>  e R..3} 

Left  image  of  b e B in  R.(A,B),  i.e. 

{a  e ORG  (R . ) | < a,b>  e R3 } 

1 3 

< m . ,m . ,M . > 

3 3 3 

<n . ,n . ,N . > 

3 3 3 

Item  origin  count;  number  of  R . e SR 
such  that  1^  = ORG(Rj)  3 

Item  destination  count;  number  of  R.  e SR 
such  that  1^  = DST(Rj)  3 

The  set  of  sink  relations;  i.e. 

SI NR  = {R.  e SR  | I DSC ( i ) = 1 and 
IORC(i)  =3 0 where  I±  = DST (Rj ) } 

The  set  of  source  relations;  i.e. 

SONR  = (R.  e SR  | IORC(i)  = 1 and 
IDSC(i)  =3 0 where  Ii  = ORG (Rj ) } 

An  operation;  op  is  the  operation  code  of 
0,  B £ SI  is  the  operand  of  0,  A c SI  is 
the  qualifier  of  0,  R = R(A,B)  or 
R = R(B,A)  and  f is  the  frequency  of  0 

The  set  of  all  possible  operation  codes  op 

Run  unit  number  k;  a sequence  of  operations 

The  set  of  run  units  of  an  application 

Cardinality  of  SU 

Number  of  operations  in  RU^ 

£-th  operation  of  k-th  run  unit 


RUUR . 


Number  of  run  units  that  use  relation  R. 


Symbol 

PSIk 

PSRk 


TMSB 

PGSZ 

MSPG 

PT.RS 

CNTRS 

PLOKS 

fd(A) 


V 


vd  (B) 

fa(B) 
va  (A) 
mp,  op 

nP/  PP 
MPL 

CC (MPL , c) 
AC (MPL,  j) 

SC (MPL,  j) 

MMSB 

OMSB 

NATD,  NATS 


Description 

The  set  of  data  items  used  by  k-th  run 

unit  RU. 

k 

The  set  of  relations  used  by  the  k-th 
run  unit  RU^ 

Maximum  storage  bound 

Size  of  a page 

Number  of  data  pages  in  main  storage 

Size  of  a pointer 

Size  of  a counter 

Size  of  a privacy  lock 

A fixed-length  vector  of  duplicate 
A-values 

A variable-rlength  vector  of  duplicate 
B-values 

A fixed-length  vector  of  B-values 

A variable-length  vector  of  A-values 

Member  and  owner  pointers,  respectively 

Next  and  prior  pointers,  respectively 

An  integer,  1 < MPL  < 24  representing  an 
implementation  number 

Constraint  condition  number  c for  imple- 
mentation MPL 

A variable  which  is  equal  to  one  if  imple- 
mentation MPL  is  acceptable  for  relation 
R j , zero  otherwise 

Storage  cost  of  implementation  MPL  for 
relation  R. 

NI  3 

= E ITSZ ( i) *ITCR ( i) 
i=l 

= TMSB  - MMSB 

Numbers  of  access  types  to  data  items  and 
data  base  sets,  respectively 


xii 


Symbol 


Description 


TLD . , TLS . 

3 D 

TC (MPL , j) 

C 

CS(C),  CT  (C) 

TC  (MPL,  j,  o\) 


V 


PBSZ 


Total  lock  sizes  for  single  linkage  and 
association  implementations  of  relation 
R j , respectively 

Time  cost  of  implementation  MPL  for  rela- 
tion R . 

D 

A configuration;  C = {<  R . , MPL  . > | j = 1, 
2,  NR;  Rj  e SR,  1 < -’mPL  <^24} 

Cumulative  storage  and  time  costs, 
respectively,  of  configuration  C 

Time  cost  of  implementation  MPL  for  rela- 
tion Rj  in  (the  S.-th  operation  of  k-th 
run  unit)  0£ 

Page  fault  probabilities  (p^  for  direct 
access,  P2  for  next  or  prior  access,  p, 
for  related  record  access  and  p^  for  aJ 
privacy  lock  access) 

Size  of  the  page  buffer;  PBSZ  = PGSZ*MSPG 


ARL 


Average  record  length 


DRSZ 


Size  of  a dummy  record 


IMP  ( j ) 
PC(SSR) 


= (MPL  | AC (MPL,  j)  = 1};  the  set  of 
acceptable  implementations  for  relation  Rj 

A partial  configuration  for  a subset  SSR 
of  SR 


OVF  j (s) 


Objective  (or  optimal)  value  function; 
optimal  j-stage  solution  to  state  s 


q = < MPL, S ,T>  A choice  for  relation  R 


RL  (k) 

Qk 

NCHS(k) 
FMP ( i, k) 


The  maximal  set  of  undominated  choices 
for  R j , ordered  in  increasing  sequence  of 
storage  costs 

The  ordinal  number  in  SR  of  the  relation 
associated  with  stage  k 

d _ 

°RL(k) 

Number  of  elements  of  Q 

)c 

First  coordinate  of  the  i-th  element  q. 
of  Qk  1 


xiii 


V 


CHAPTER  I 
INTRODUCTION 

1.1  Motivation  and  Objectives 

The  application  of  computer  systems  to  store  the  oper- 
ational data  of  large  organizations  has  become  an  important 
part  of  such  organizations'  activities,  in  recent  years. 

The  benefits  of  using  computerized  data  base  systems  have 
been  widely  recognized  and  well  documented  [Date  1975], 

[Beck  1976],  [GUIDE-SHARE  1970], [Berg  1976].  The  most 
important  benefit  of  an  integrated  data  base  is  that  it 
provides  centralized  control  of  data.  The  advantages  of 
central  data  control,  among  others,  are  that  it  reduces 
redundancies  in  data  storage,  facilitates  sharing  of  data 
among  multiple  applications,  enables  the  data  base  administrator 
to  enforce  standards  in  data  representation,  and  provides 
means  to  apply  security  restrictions  and  maintain  data 
integrity.  Another  important  advantage  of  central  data 
control  is  that  it  enables  the  data  base  administrator  to 
structure  the  data  base  system  so  that  the  overall  perfor- 
mance is  best  for  the  organization  rather  than  for  any 
individual  application.  In  particular,  the  data  base 
administrator  can  choose  a structure  for  data  storage  that 
models  the  requirements  of  the  organization  in  terms  of 
data  values  and  relationships  and  at  the  same  time  can 
optimize  the  processing  performance  of  a selected  set  of 
applications  whose  combined  performance  is  critical  to 

1 


2 


overall  system  performance. 

The  designer  of  the  data  base  structure  has  knowledge 
of  the  basic  entities  about  which  information  has  to  be 
stored  in  the  data  base.  He  also  is  aware  of  the  asso- 
ciations that  relate  these  basic  entities.  These  asso- 
ciations, being  as  important  to  the  organizations'  infor- 
mational needs  as  the  basic  entities  they  relate,  must  be 
represented  in  the  data  base.  The  realization  of  this 
representation  may  be  accomplished  by  physical  contiguity 
in  storage,  by  pointers  or  by  any  other  method.  Different 
methods,  however,  result  in  different  storage  space  and 
processing  time  requirements.-  It  is  clear  that  a large 
number  of  data  base  designs  are  available  where  each  has 
advantages  and  disadvantages  with  respect  to  a given  per- 
formance objective  and  that  no  absolutely  optimal  design 
exists  for  a particular  set  of  user  requirements. 

The  objective  of  this  research  is  to  develop  a metho- 
dology for  automatic  design  of  the  logical  structure  of  a 
paged  data  base.  The  methodology  may  be  used,  by  a data 
base  designer  as  a design  tool,  to  investigate  the 
space/time  trade-offs  of  a large  number  of  logical  struc- 
tures that  can  model  the  information  content  of  a given  set 
of  user  requirements.  The  approach  is  to  map  a high  level 
description  of  user  requirements  in  terms  of  individual 
data  items  and  data  base  relations  into  a set  of  record 


structures,  record  relationships  and  storage  structures. 


3 


Intra-  and  inter-record  structures  generated  by  this  design 
method  conform  to  the  specifications  published  in  the  Data 
Base  Task  Group  ( DBTG)  report  of  the  Conference  on  Data 
Systems  Languages  [CODASYL  1971].  The  data  base  is  assumed 
to  be  accessed  in  a paging  environment.  The  mapping  will 
be  such  that  the  resulting  data  base  structure  is  optimal 
(over  a certain  class  of  DBTG  structures)  in  the  sense  that 
it  produces  a minimum  expected  number  of  page  faults  for  a 
pre-specif ied  set  of  user  activities,  subject  to  storage 
and  security  constraints. 

The  user  requirements  model  presented  in  Chapter  2 
provides  the  means  for  specifying  the  basic  data  items  and 
relations  between  pairs  of  data  items  in  a statistical 
description, namely  one  that  requires  information  about 
names,  sizes  and  number  of  occurrences  rather  than  actual 
values.  This  model  also  provides  a set  of  12  data  manipu- 
lation operations  in  terms  of  which  expected  frequencies 
and  types  of  data  retrieval  and  update  (user  activities) 
may  be  expressed.  Other  components  of  this  model  allow  for 
the  description  of  user's  storage  space  requirements  in 
terms  of  parameters  such  as:  total  allowable  storage  limit, 

and  sizes  of  pointers,  counters,  privacy  locks  and  memory 
pages.  In  the  following  we  will  refer  to  a given  specifi- 
cation of  all  parameters  of  the  user  requirements  model  as 
an  application. 


In  Chapter  3 we  will  define  a set  of  24  implementation 


4 


alternatives  which  can  be  used  to  implement  data  base  rela- 
tions, in  such  a way  that  the  collection  of  exactly  one 
implementation  for  each  relation  defined  in  the  application 
determines  the  overall  structure  of  the  data  base.  The 
type  of  the  relation  implementation  determines  whether  the 
relationship  between  component  data  item  types  will  be 
realized  in  the  data  base  by  physical  adjacency  or  by 
pointers  through  the  use  of  data  base  sets  or  any  other 
mechanism.  Also  in  the  case  where  data  base  sets  are  to 
be  used  the  type  of  set  implementation  technique  is  deter- 
mined by  the  type  of  the  relation  implementation. 

A two-component  cost  function  will  be  developed  in 
Chapter  3 for  relation  implementations  that  are  acceptable, 
namely  those  that  satisfy  certain  constraint  conditions. 
Constraint  conditions  act  as  a filter  to  prevent  the  selec- 
tion of  unconformable  DBTG  structures.  The  first  cost 
component  is  the  storage  cost  of  a relation  implementation 
and  consists  of  the  storage  space  necessary  to  store  data 
item  values,  counter  and  pointer  variables  and  security 
locks.  The  second  cost  component  is  called  the  time  cost 
of  a relation  implementation  measured  in  the  number  of 
page  faults  expected  to  occur  when  a certain  set  of  data 
manipulation  operations  is  executed.  Time  cost  functions 
will  be  defined  in  terms  of  probabilities  of  a certain 
number  of  page  fault  categories.  Probability  of  a page 
fault  in  each  category  will  be  computed  from  the  variables 


5 


describing  the  application. 

The  task  of  the  optimization  algorithm  is,  then,  to 
select  exactly  one  implementation  for  each  relation  such 
that  the  total  time  cost  in  minimized,  total  storage  cost 
is  bounded  by  some  given  value  and  certain  constraint  con- 
ditions are  satisfied.  This  algorithm  is  presented  in 
Chapter  4 and  it  treats  the  optimization  problem  as  a 
discrete,  multi-stage  decision  process.  For  each  relation 
defined  in  the  application,  acceptable  implementations  and 
their  storage  and  time  costs  are  first  determined  using 
the  models  of  Chapter  3.  Then  a sequence  of  "undominated 
choices"  for  each  relation  is  generated.  If,  for  any  rela- 
tion, this  sequence  consists  of  only  one  choice  then  the 
relation  is  excluded  from  further  consideration;  the  opti- 
mal choice  of  implementation  for  that  relation  is  fixed 
and  known.  The  next  phase  of  the  decision  process,  then, 
treats  each  remaining  relation  (relations  with  two  or  more 
undominated  choices)  as  one  decision  "stage".  In  stage  k 
of  the  second  phase,  a sequence  of  "undominated  solutions" 
to  stage  (k+1)  is  generated  from  the  sequence  of  undomi- 
nated solution  to  stage  k and  undominated  choices  of  stage 
k+1.  The  sequence  of  undominated  solutions  to  the  first 
stage  is  identical  to  the  sequence  of  undominated  choices 
of  stage  1 and  the  undominated  solutions  to  the  final  stage 
gives  a spectrum  of  solutions  where  each  one  is  optimal 
over  a certain  range  of  storage  space  limits. 


7?T 


6 


Results  obtained  by  applying  our  data  base  design 
methodology  to  an  example  application  introduced  in 
Chapter  2 will  be  discussed  in  Chapter  5.  This  chapter 
includes  the  results  of  a sensitivity  analysis  for  time 
cost  functions  with  respect  to  certain  parameters.  A 
number  of  data  base  designs,  each  optimal  under  a particu- 
lar set  of  conditions,  will  be  discussed  and  their 
storage/time  trade-offs  will  be  shown.  Finally,  the  effect 
of  varying  some  of  the  important  parameters  of  the  model 
on  the  outcoming  data  base  designs  will  also  be  discussed 
in  Chapter  5. 

Other  methodologies  for  .the  design  of  DBTG  data  struc- 
tures have  been  reported  by  Gerritsen  [1975],  Mitoma  [1975], 
Bubenko  [1976],  etc.  While  the  main  goals  of  these  research 
efforts  are  the  same,  namely  the  design  of  DBTG  data  struc- 
tures, they  have  substantial  differences  in  the  modelling 
of  the  real  problem,  their  approach  in  arriving  at  a solu- 
tion and  most  importantly  in  their  performance  objectives. 
Gerritsen  [1976]  developed  an  automatic  data  base  "designer" 
capable  of  designing  a "satisfactory"  data  base  structure 
for  a set  of  anticipated  queries.  These  queries  consti- 
tute the  input  to  the  designer  system.  A series  of  asser- 
tions are  produced  from  the  input  and  then  a set  of  records 
and  record  relationships  is  constructed  such  that  all  of 
the  assertions  are  satisfied.  The  possibilities  of  occur- 
rence of  cases  where  several  alternative  designs  may  satisfy 


7 


a given  set  of  assertions  are  not  considered  and,  conse- 
quently, criteria  for  selection  of  one  data  structure 
design  from  among  several  candidates  are  not  presented  in 
[Gerritsen  1976] . 

Bubenko  [1976]  takes  a different  view  of  data  base 
schema  design  and  proposes  the  design  of  two  extreme  alter- 
natives followed  by  successive  refinements  and  evaluations 
to  arrive  at  a compromise.  At  one  extreme,  data  will  be 
stored  on  initial  (transaction)  level  and  answers  to  queries 
are  derived  when  necessary.  This  extreme  minimizes  storage 
space  requirements  at  the  expense  of  long  response  time. 

At  the  other  extreme,  information  is  organized  such  that 
the  response  to  every  anticipated  query  is  explicitly 
stored  and  modified  with  every  transaction  that  affects  it. 
In  the  evaluations  of  alternative  solutions  factors  such  as 
secondary  storage  space  requirements,  number  of  data  base 
accesses  for  updates  and  number  of  data  base  accesses  for 
answering  queries  are  considered.  A certain  degree  of 
redundancy  in  data  storage  is  to  be  expected  in  the  data 
base  designs  produced  by  this  methodology.  The  degree  of 
this  redundancy  depends  on  the  proximity  of  the  final 
solution  to  the  two  extremes.  This  is  because  information 
in  one  record  occurrence  can  "be  derived  from  information 
in  record  occurrences  to  which  it  is  an  owner". 

Mitoma  [1975]  gives  a methodology  for  data  base  schema 
design.  The  performance  objective  employed  in  that  work  is 
the  number  of  record  accesses  required  over  a period  of 


8 


time  to  process  an  anticipated  set  of  operations.  Record 
and  set  definitions  are  designed  so  as  to  minimize  the 
number  of  record  accesses  subject  to  storage  and  feasibility 
constraints.  For  data  base  set  types  generated  by  this 
design  methodology,  specifications  for  LINKED  TO  OWNER  and 
PRIOR  PROCESSABLE  are  determined  by  the  design,  although 
the  specific  techniques  employed  to  implement  data  base 
sets  are  not  considered.  The  effect  of  different  alterna- 
tive techniques  for  the  implementation  of  data  base  sets 
on  the  performance  of  data  base  systems  has  been  recognized, 
for  example,  in  [Bachman  1974] . 

1.2  DBTG  Proposal 

The  CODASYL  Data  Base  Task  Group  specifications  for 
data  base  management  systems,  published  in  [CODASYL  1971] , 
constitute  a major  development  in  data  base  technology. 
Concepts  presented  by  these  specifications  have  been  used 
in  several  commercially  available  data  base  management 
systems  including  DMS1100  [UNIVAC  1974],  DBMS-10  [DEC], 

IDMS  [Cullinane]  fcnd  IDS  [Honeywell] , and  are  expected  to 
influence  future  national  standards  in  data  base  technology. 
The  DBTG  proposal  is  composed  of  specifications  for  three 
languages:  schema  Data  Definition  Language  (schema  DDL), 
a sub-schema  Data  Definition  Language  (sub-schema  DDL) 
intended  for  use  with  COBOL  and  a Data  Manipulation 
Language  (DML) . The  COBOL  and  non-COBOL  portions  of  the 
original  DBTG  report  were  later  extended  and  modified  by 


9 


Data  Base  Language  Task  Group  (DBLTG)  and  Data  Definition 
Language  Committee  (DDLC) , respectively.  CODASYL  COBOL 
Data  Base  Facility  Proposal  [CODASYL  1973a]  is  DBLTG' s 
report  and  consists  of  a detailed  proposal  for  a COBOL  DML 
and  sub-schema  DDL  which  is  basically  the  same  as  DBTG 
proposal  [CODASYL  1971]  with  some  differences  in  syntactic 
details.  DDLC's  report  appeared  in  June  1973  as  CODASYL 
DDLC  Journal  of  Development  [CODASYL  1973b] . As  stated  in 
the  report,  "the  DDLC  limited  its  efforts  primarily  to 
clarif ication  of,  rather  than  extension  to,  the  base  docu- 
ment . " 

In  this  section  we  will  briefly  review  those  components 
of  the  DBTG  data  definition  model  that  are  relevant  to  our 
work.  The  reader  is  referred  to  [CODASYL  1971],  [CODASYL 
1973a]  and  [CODASYL  1973b]  for  a more  complete  description 
and  to  [Date  1975]  and  [Taylor  1976]  for  a tutorial  treat- 
ment of  the  underlying  concepts.  The  major  DBTG  data  defi- 
nition concepts  that  are  relevant  to  our  work  are  data 
items,  data  aggregates  (vectors  and  repeating  groups), 
records,  data  base  sets  and  privacy  locks. 

A data  item  is  the  smallest  unit  of  named  data.  A 


contiguous  sequence  of  storage  locations  that  contains  the 


10 


type  named  EMPBRT  (employee  birth)  and  of  size  8 may  be  a 
sequence  of  8 storage  locations  containing  ”06/29/56". 

Data  items  can  also  take  on  null  values  or  pointer  (data- 
base-key) values. 

Data  aggregates  are  collections  of  data  items  within 
a record  that  can  be  referenced  as  a group  using  the  name 
of  the  data  aggregate.  There  are  two  kinds  of  data  aggre- 
gates: vectors  and  repeating  groups. 

A vector  is  a one-dimensional  array  of  data  items  of 
the  same  data  item  type,  similar  to  a FORTRAN  array.  Vec- 
tors appearing  at  the  first  level  of  a record  definition 
can  be  of  variable-length,  where  the  number  of  elements  of 
the  vector  may  be  specified  by  the  value  of  a counter  data 
item  in  the  same  record.  A fixed-length  vector  may  be 
defined  at  any  level  of  record  definition  and  the  number 
of  elements  of  the  vector  is  fixed  at  a value  specified  at 
data  base  design  time. 

A repeating  group  is  a collection  of  data  occurring 
repeatedly  in  a record.  A repeating  group  may  contain 
data  items,  vectors  or  repeating  groups.  This  provides 
the  capability  of  defining  nested  repeating  groups.  The 
number  of  occurrences  of  a repeating  group  (in  a record) 
can  be  specified  by  the  value  of  a (counter)  data  item  in 
the  record  if  this  number  is  variable.  However,  just  as 
in  the  case  of  vectors,  variably  dimensioned  repeating 
groups  can  occur  only  in  level  one  of  a record  definition. 


11 


In  other  words,  a variably  dimensioned  repeating  group 
cannot  be  defined  as  part  of  another  repeating  group. 

A record  is  the  unit  of  access  in  a DBTG  system.  It 
can  consist  of  data  items,  vectors  and  repeating  groups  of 
fixed-  or  variable-lengths.  Records  with  identical  struc- 
ture are  grouped  into  record  types.  Number  of  records  of 
a given  type  need  not  be  explicitly  specified  anywhere  in 
the  data  base,  in  contrast  to  repeating  groups  for  which 
the  number  of  occurrences,  in  the  containing  repeating 
group  or  record,  is  given  either  at  data  definition  time 
(fixed-length)  or  as  the  value  of  a data  item  in  the  con- 
taining record  (variable-length) . Records  are  uniquely 
identifiable  by  their  data-base-keys.  A record  may  be 
directly  accessed  if  its  data-base-key  is  known,  whereas  a 
repeating  group  is  accessible  only  when  the  record  con- 
taining it  is  available  to  the  executing  program.  Record 
types  may  be  related  to  each  other,  using  data  base  sets, 
to  implement  complex  network-type  associations.  However, 
within  a record  only  hierarchical  associations  can  be 
acconimodated  (using  repeating  groups)  . 

A data  base  set  is  a collection  of  related  records. 
One  of  the  records  in  the  set  is  called  the  owner,  and 
the  remaining  (zero  or  more)  records  are  called  member 
records  cf  the  set.  The  existence  of  a set  is  established 
by  the  existence  of  its  owner  record  and  it  is  "empty"  if 
it  does  not  contain  any  member  records.  Owner  and  member 
records  of  the  same  set  cannot  be  of  the  same  type.  Sets 


12 


of  identical  structure  are  grouped  into  set  types.  A 
record  type  may  be  declared  as  member  record  type  in  many 
set  types  and  at  the  same  time  as  owner  record  type  in  many 
otner  set  types.  However,  a record  (occurrence)  cannot 
belong  to  (participate  as  owner  or  member  in)  more  than 
one  set  (occurrence)  of  a given  type.  It  is  evident  from 
the  foregoing  that  a data  base  set  type  can  only  implement 
one-to-many  relationships  between  occurrences  of  its  owner 
and  member  record  types.  A set  type  may  be  declared  with 
more  than  one  member  record  type. 

Protection  of  data  is  achieved  through  a system  of 
locks  and  keys.  Privacy  locks  may  be  declared  in  the 
schema  for  different  types  of  accesses  to  sets,  records, 
data  items,  etc.  Run  units  seeking  access  to  data  should 
provide  privacy  keys  (at  execution  time)  and  each  key  is 
simply  compared  with  the  corresponding  lock  for  a match. 
Multiple  locks  may  be  specified  for  a given  pair  (access 
type,  resource) . A request  for  that  pair  is  granted  if 
the  key  provided  by  the  run  unit  matches  any  one  of  the 
locks.  Privacy  locks  and  keys  may  be  defined  as  procedures 
(or  results  of  procedures)  that  provide  the  value  of  a key 
or  take  on  some  action. 

A run  unit  is  an  execution  of  one  or  more  programs, 
written  in  some  procedural  language  (host  language)  augmen- 
ted to  support  a set  of  Data  Manipulation  Language  (DML) 
commands.  A complete  description  of  a data  base  in  terms 
of  DDL  entries  is  the  schema  for  the  data  base.  It  includes 


13 


definitions  of  all  record  types  and  their  component  data 
items  and  data  aggregates,  set  types,  privacy  locks,  etc. 

A sub-schema  is  a "consistent  and  logical"  subset  of  the 
schema  from  which  it  is  derived.  The  concept  of  sub-schema 
is  employed  to  separate  the  description  cf  the  entire  data 
base,  as  known  to  the  data  base  administrator,  from  the 
description  of  portions  of  it,  as  he  wishes  to  be  known  by 
individual  programs.  A sub-schema  is  invoked  by  a run  unit 
using  a DML  command.  This  invocation  causes  the  automatic 
definition  of  a User  Working  Area  (UWA)  for  the  run  unit. 

A location  is  provided  in  UWA  for  each  data  item  included 
in  the  sub-schema.  Each  such  data  item  may  be  referenced 
by  the  program  using  its  name  as  declared  in  the  sub- 
schema— which  may  be  different  from  the  one  declared  in  the 
schema . 

In  our  data  base  design  descriptions  discussed  in 
Chapter  5 we  will  not  give  the  exact  schema  for  the  data 
base,  namely  a syntactically  correct  set  of  record  and  set 
declarations  written  in  DDL  statements.  The  data  base 
design  descriptions  will,  however,  consist  of  the  following 
information  from  which  a designer  can  easily  derive  all 
schema  definitions  concerning  the  logical  structure  of 
records  and  sets. 

All  record  types  of  the  data  base  are  determined  and 
for  each  record  type  all — if  any — data  item  types,  vectors 
and  repeating  groups  contained  in  it  are  determined.  Size 


14 


of  each  data  item  type,  number  of  occurrences  of  each 
repeating  group  and  number  of  elements  of  each  vector  are 
given  as  a constant  (for  fixed-lengths)  or  as  the  value  of 
another  data  item  type  (for  variable-lengths). 

All  set  types  of  the  data  base  are  determined  and  for 
each  set  type  its  owner  record  type  and  its  member  record 
type  are  given.  For  each  set  type,  it  is  also  specified 
how  occurrences  of  sets  of  that  type  should  be  implemented. 
Set  implementation  techniques  considered  are:  chain,  chain 

with  owner  pointers,  chain  with  prior  pointers,  chain  with 
owner  and  prior  pointers,  pointer  array  and  pointer  array 
with  owner  pointers. 

In  this  data  definition  design  we  are  primarily  con- 
cerned with  the  logical  record  and  set  structures.  We  are 
not  attempting  to  model  and  design  all  of  the  features 
included  in  a DBTG  data  model.  In  particular,  the  following 
non-structural  data  model  components  will  not  be  modelled 
and  designed. 

i)  Groupings  of  records  in  AREAS  (REALMS) . 

ii)  Access  method  specifications  such  as  SINGULAR 

SET  declarations,  INDEX  and  SEARCH  KEY  declara- 
tions and  CALC  KEY  definitions. 

iii)  Data  sub-model  (sub-schema)  specifications, 

iv)  LOCATION  MODE  specifications  for  records. 


v)  Some  data  item  characteristics  such  as  VIRTUAL 
or  ACTUAL  RESULT  or  SOURCE  specifications. 


15 


ENCODING/DECODING  and  validity-  or  error-checking 
(CHECK  and  ON  clauses) . 

vi)  Some  data  base  set  characteristics  such  as  mem- 
bership classes  (MANDATORY/OPTIONAL  or 
AUTOMATIC/MANUAL)  and  SET  SELECTION  criteria. 

Further  limitations  and  simplifying  assumptions  made 
for  this  work  will  be  explained  where  those  assumptions  are 
used,  e.g.,  assumptions  regarding  the  page  fault  probability 
model  are  given  in  3.4.1.  Some  of  the  constraint  conditions 
of  Section  3.2  are  only  sufficient  (rather  than  both  neces- 
sary and  sufficient) . The  necessity  and  effects  of  adopt- 
ing those  conditions  are  discussed  in  Section  3.2.  Finally, 
some  areas  of  extension  to  the  implementation  alternatives 
and  reasons  for  not  incorporating  them  into  our  model  are 
discussed  at  the  end  of  Chapter  3. 


CHAPTER  II 

USER  REQUIREMENTS  MODEL 

In  this  chapter  we  will  formally  define  a model  cap- 
able of  describing  the  informational  requirements  of  an 
application  at  a level  where  the  ultimate  record  and  set 
structures  of  the  data  base  are  not  known.  It  is  in  terms 
of  the  parameters  of  this  model  that  the  data  base  designer 
specifies  an  application.  Implementation  models  of  Chapter 
3 will  then  use  this  information  to  analyze  different  imple- 
mentation alternatives  by  associating  storage  and  time 
costs  to  each  alternative.  The  optimization  algorithm  of 
Chapter  4,  then  makes  all  the  record  and  set  structuring 
decisions  to  develop  an  optimized  data  base  design. 

In  Section  2.1  we  will  discuss  the  first  component  of 
the  user  requirements  model,  namely  the  concepts  of  data 
item  types  and  data  item  values.  Next  in  Section  2.2  we 
will  discuss  data  base  relations.  In  Section  2.3  we  will 
discuss  the  component  of  the  model  through  which  data 
access  forms  and  frequencies  are  specified  in  terms  of  a 
certain  number  of  operations  using  data  base  relations. 

The  last  component  of  the  user  requirements  model  concerns 
the  storage  space  requirements  discussed  in  Section  2.4. 

2.1  Data  Item  Types  and  Data  Item  Values 

A data  item  value  is  a string  of  characters  chosen 
from  a given  character  set  with  a known  ordering.  The 


16 


17 


character  set  may  be,  for  example,  the  set  of  EBCDIC  char- 
acters with  the  conventional  ordering.  The  size  (or  length) 
of  a data  item  value  is  the  number  of  characters  it  con- 
tains. For  example,  if  CHAR  = (a^,  a^,  ...  , am)  is  a 
character  set  ordered  such  that  a^  < 3j+1*  j=l,  •••  , m - 1 

then  d.  = a.  a.  ...  a.  where  a.  e CHAR,  j = 1,  ...  , s 
1 11  x2  xs  xj 

is  a data  item  value  of  size  s.  a named  set  I of  data  item 
values  d^  of  equal  size  is  called  a data  item  type.  Car- 
dinality of  this  set  is  the  cardinality  of  the  data  item 
type. 

I = (d^  | d.  is  a data  item  value  of  size  a", 
i = 1,  2,  ...  , a' } 
a"  is  the  size  of  I 
a'  is  the  cardinality  of  I (|l|) 

Data  items  d^,  i = 1,  ...  , a'  of  the  same  type  I are 

ordered  in  the  conventional  way,  i.e.,  if 

1 2 a ' 1 2 a ' 

d.  = a.  •••  3;  and  d.  = a.  a.  ...  a.  then 

i i 1 i ll7  1 

k k 

d.  < d.  iff  for  some  k e (1,  2,  ...  , a'},  a.  < a.  and 
l j l j 

a?  = aP  for  all  p < k. 
i 3 F 

The  set  of  all  data  item  types  of  an  application  is 
denoted  by  SI  = (1^,  I^,  ...  , INI}.  The  following  varia- 
bles characterize  each  data  item  type  1^  e SI, 

1 — 1,  2,  ...  , NI . 

ITNM(i)  /*  Name  of  the  data  item  type  1^  */ 

ITSZ(i)  /*  Size  of  It  V 

ITCR(i)  /*  Cardinality  of  Ii  */ 

Figure  2.1  illustrates  an  example  set  of  NI  = 24  data 


18 


_i 

ITEM  NAME 

SIZE 

CARDINALITY 

1 

ORGCOD 

10 

50 

2 

ORGNAM 

30 

50 

3 

BUDGET 

6 

50 

4 

JOBCOD 

10 

500 

5 

ATHQNT 

6 

500 

6 

ATHSAL 

6 

1000 

7 

ATHMIN 

6 

1000 

8 

ATHMAX 

6 

1000 

9 

DDUCTN 

15 

2000 

10 

JOBTTL 

60 

500 

11 

EMPNUM 

25 

5000 

12 

EMPNAM 

30 

5000 

13 

JHSYRS 

2 

30 

14 

BRTDAT 

8 

5000 

15 

EMPLVL 

2 

20 

16 

EMPADR 

100 

5000 

17 

EMPMST 

2 

10 

18 

OFRCOD 

25 

100 

19 

OFRFMT 

3 

10 

20 

OFRDAT 

8 

50 

21 

OFRLOC 

10 

50 

22 

COUNUM 

15 

50 

23 

COUTTL 

60 

50 

24 

COUDSP 

250 

50 

Figure  2.1  An  Example  Set  SI  of  24  Data  Items  1^ 


19 


item  types  of  an  application.  The  sixteenth  data  item  type 
1^^  is,  for  example,  called  EMPADR  and  consists  of  5000 
employee  addresses  each  of  size  100  characters.  The  tenth 
data  item  type  1^  is  called  JOBTTL  and  consists  of  500 
job  titles  each  of  size  60  characters,  and  so  on.  The 
exact  values  of  data  items,  for  example  the  500  job  titles, 
have  no  significance  in  our  optimization  problem  and 
therefore  they  need  not  be  specified. 

2.2  Data  Base  Relations 

Associations  between  different  pairs  of  values  of  data 
items  are  modelled  by  the  concept  of  a data  base  relation. 
If  A and  B are  two  data  item  types  in  SI  then  a relation 
R (A, B)  over  A and  B is  a named  set  of  ordered  pairs  <a,  b> 
where  a and  b are  data  item  values  of  data  item  types  A 
and  B,  respectively.  We  call  A and  B the  origin  and 
destination  data  item  types  of  relation  R,  respectively, 
and  denote  them  as  follows. 

A = ORG (R) , B = DST (R) 


In  set  theoretic  terms  R (A, B)  is  a subset  of  the 
cartesian  product  A x B.  For  our  purpose  the, specif ication 
of  a data  base  relation  R.  over  data  item  types  A and  B is 


20 


relations  of  an  application  and 
NR  - | SR | 


RNAME  = RNAM(j) 
A = ORG (R j ) 

B = DST (R j ) 

r . 

D 

< m . , m . , M . > 

3 3 3 


N . > 

3 


is  the  name  of  the  relation  R. 

3 

is  the  origin  data  item  type 
of  R j 

is  the  destination  data  item 
type  of  Rj 

is  the  number  of  ordered  pairs 
in  Rj  and  is  called  the  cardin- 
ality of  the  relation  R. 

D 

is  an  ordered  triple  represent- 
ing the  minimum,  average  and 
maximum  number  of  B-values 

related  to  one  A-value  in  R . , 

3 

respectively,  and  the  average 
is  taken  over  the  A-values 
related  to  at  least  one  B-value 
in  R j 

is  an  ordered  triple  represent- 
ing the  minimum,  average  and 
maximum  number  of  A-values 


related  to  one  B-value  in  R . , 

3 

respectively,  and  average  is 


taken  over  B-values  related  to 


at  least  one  A-value  in  R ^ . 

We  will  use  the  notation  R..  (A,  B)  (instead  of  simply 
to  refer  to  the  j-th  relation  in  SR  wherever  it  is 


21 


necessary  to  identify  the  origin  and  destination  data  item 
types  of  the  relation.  In  such  cases  we  will  use  the  six 
sets  of  variables  a ^ , a!,  a^,  8_.,  8^  and  8?  with  the 
following  definitions. 

are  the  cardinalities  of  the  data  item 


“i-  6i 


VV 


a’.',  8’! 
3 3 


types  A = ORG  ( R ^ ) and  B = DST(R..), 
respectively. 

is  the  number  of  A (B) -values  related  to  at 

least  one  B (A) -value  in  R . . 

3 

are  the  sizes  of  data  item  types  A and  B, 
respectively . 


It  is  clear  from  the  above  definitions  that  m.  and  n. 

3 3 


are  given  by  the  following. 


r . 
= 5T2 


r . 

-ij 


Furthermore,  if  m^  / 0 then  every  A-value  is  related  to  at 

least  one  B-value  in  R.  and  therefore  a.  = a'..  Similarly 

3 3 3 

n^  / 0 implies  8^  = 8j. 

For  every  element  a e ORG(R^)  the  set  of  data  item 
values  (b  e DST (R ^ ) | <a,  b>  e R ^ } is  called  the  right 

imige  of  a in  R^  denoted  by  RIM^ (a) . Elements  of  RlM^(a) 
are  assumed  to  be  ordered  with  the  same  ordering  as 
defined  for  data  item  values  of  type  B,  for  all  a e A,  so 
that  we  can  refer  to  the  i-th  B-value  related  to  an  A-value 
in  relation  R^ (A,  B) . Similarly  the  left  image  of 
b e DST(Rj)  in  R^  is  defined  as: 


LIM. 


j (b)  = {a  e ORG(Rj)  | <a,  b>  £ R.}  , 


f " _ ' '■ 


22 


which  is  again  an  ordered  set  of  A-values.  Figure  2.2 
illustrates  some  of  the  above  definitions  for  a simple 
example  relation  R ( A,  D)  = {<a^,  b^>,  <a2»  b^>,  <a2'  b2>' 

<a2'  b3>  • <a3'  b3>,  <a3'  b4 > ' <a4'  b3>,  <a5/  b3>}. 


a’  = | A | = 8 , S’  = | B | = 4 A R B 


Figure  2.2  An  Example  Relation  R(A,  B) 

The  set  of  all  data  base  relations  for  a data  manage- 
ment application  is  denoted  by  SR  and  its  cardinality  is 
denoted  by  NR: 

SR  — , R2 , • • • , } • 

Figure  2.3  shows  an  example  of  26  data  base  relations 

defined  over  the  24  data  item  types  shown  in  Figure  2.1. 

We  make  the  assumption  that  each  data  item  value  of  every 

data  item  type  1^  e SI  is  related  to  at  least  one  data  item 

value  in  some  relation  R^  e SR.  In  other  words  for  all 

1^  e SI  and  for  all  a e 1^  there  exists  R.  e SR  and  b such 

that  either  <a,  b>  e R.  or  <bf  a>  e R.. 

D 3 

In  Figure  2.3  the  26th  relation  R26,  for  example,  is 
called  OFROFCOU . This  relation  is  intended  to  associate 
offering  codes  OFRCOD  of  all  offerings  of  each  course  to 


23 


its  course  number  COUNUM.  The  origin  and  destination  data 
item  types  of  this  relation  are  called  COUNUM  and  OFRCOD, 
respectively.  There  are  100  ordered  pairs  < counum,  ofrcod> 
in  1*26  where  counum  and  ofrcod  are  data  item  values  of  data 
item  types  called  COUNUM  and  OFRCOD,  respectively.  The 
first  multiplicity  vector  <m26'  m26'  M26>  '*'s  e<3ua^  to 
<0,  2,  10>  which  means  that  a course  has  at  least  zero  and 
at  most  10  offerings  and  that  on  the  average  it  has  two 
offerings.  As  another  example  relation  R2^  called  STUDENTS 
associates  employee  numbers  of  employees  enrolled  in  an 
offering  (of  a course)  to  its  offering  code.  There  are  2100 
pairs  in  R23-  At  least  at  most  50  and  on  the  average  21 
students  (employees)  are  enrolled  in  each  offering.  Some 
employees  are  not  enrolled  in  any  offering  (r^^  = 0)  , 
however,  among  those  enrolled  each  student  is  enrolled  in  an 
average  of  n2^  = 2 and  a maximum  of  N23  = 5 offerings.  This 
is  an  example  of  a many-to-many  relation,  while  relation  R^, 
called  NAMOFORG  is  an  example  of  a 1-to-l  relation. 

In  order  to  facilitate  the  development  of  our  con- 
straint conditions  in  Chapter  3 we  will  now  classify  rela- 
tions Rj  e SR,  j = 1,  2,  ...  , NR  into  eight  types  numbered 
0 through  7,  based  on  their  multiplicity  vectors 
RM12  ( j ) = < itk  , m ^ , M^>  and  RM21(j)  = <n^,  n ^ , N^>. 

Relation  R^ (A,  B)  e SR  is  of  type  zero  if 
RMl 2 ( j ) = RM21 ( j ) = <1,  1,  1>.  This  is  a one  to  one  rela- 
tionship between  all  A-values  and  all  B-values.  In  set 


theoretic  terms  R^  in  this 


^ase  is  an  invertible  function 


24 


A 

•r- 

z 

** 

|e" 


c 

v 


A 

A 

O 

O 

A 

O 

o 

A 

A A O 

o 

o 

O 

O O LA 

04 

04 

rH 

AAAAAAAAA 


r— 4 *H  O 

V V V 


H rH 
V V 


V V V V 


A ^ tO  • A •‘A  ~ A A A A A A 

<H  •»  -‘O  H O rH  O rH  H’  O 't  H LO  H 

•>10  0 1/1  * o •‘UO  •*  •* i-H 

H N H ► H LO  H M H N H N H 

* ~ -0  ^ ~ ~ ~ ~ •'CD  ~ ~ ~ ~ 

rH  O rH  tO  rH  H r— ' *H  rH  *H  t *H  O O rH 

vvvvvvvvvvvvvvv 


rH  O 

V V 


<x 


A 

•I" 

•r 

IE 


E 

v 


A 

o 

A LO  A A 

O 04  A O O A 

AA04AAAAAA  AAAAAAAAAAtOLOAAO 

rH  *H  •»  rH  rH  H"  rH  rH  04  CO  *H  rH  rH  rH  rH  rH  rH  rH  rH  rH  rH  •*  **  rH  H rH 

* *v  O + * * m + •>  O •'•'•'•‘•'•'•‘•‘•'r  •'04  H •>  •>  *> 

rH  rH  rH  H rH  04  rH  rH  04  rH  04  rH  rH  rH  rH  rH  rH  *H  rH  H rH  04  04  rH  rH  04 

rH  rH  rH  rH  rH  rH  rH  rH  04  rH  O rH  rH  rH  rH  rH  H rH  rH  H rH  rH  rH  rH  rH  O 

vvvvvvvvvvvvvvvvvvvvvvvvvv 


>H 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

O 

o 

o 

o 

o 

o 

LO 

LO 

o 

o 

© 

o 

o 

CO 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

LO 

LO 

o 

LO 

LO 

LO 

o 

LO 

LO 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

rH 

rH 

rH 

04 

r— -• 

rH 

< 

rH 

04 

LO 

o 

LO 

LO 

LO 

LO 

LO 

LO 

LO 

04 

04 

CJ 


• 

H 

C3 

H 

rJ 

X 

z 

2 

SI 

Q 

a 

CO 

H 

E— 

C* 

LJ 

H 

t- 

2! 

s 

a. 

C3 

Z 

OJ 

O 

Z 

C — « 

< 

L— < 

H 

3 

O 

o 

cz 

CO 

< 

> 

a 

o 

< 

ZD 

3D 

fx 

co 

O 

►H 

Z 

03 

03 

C/H 

CO 

ZZ! 

C_> 

Z 

CJ 

cj 

>-• 

Z 

£33 

•< 

-3 

CU 

333 

Of: 

Z 

Q 

CJ 

H 

ld 

Q 

CQ 

Z 

CQ 

1 

‘ * ' 

o-; 

=3 

0, 

03 

C3 

CO 

cC 

CLr 

f— 

x. 

Oi 

X 

Oi 

a. 

ZD 

3D 

OJ 

co 

cz 

3 

o 

O 

H 

H 

£- 

Q 

O 

o 

r** 

z 

OS 

z. 

U. 

CU 

U. 

?! 

SI 

O 

O 

a. 

Q 

o 

CQ 

•"5 

< 

< 

< 

< 

Q 

U) 

CU 

cu 

03 

CU 

CU 

O 

o 

O 

CU 

OJ 

U 

C_> 

O 

z 

C3 

a 

a 

a 

a 

Q 

r-J 

-3 

—I 

Q 

? 

§ 

?! 

Z 

?* 

C33 

Q 

Q 

a 

Q 

!T! 

!?! 

z 

►H 

O 

0 

0 

0 

0 

C 

< 

< 

O 

ZD 

zj 

p 

3D 

ZD 

ZD 

ZD 

O 

O 

O 

0 

O 

:z5 

p 

ZD 

LD 

(_3 

u 

cj 

CD 

u 

CJ 

CO 

CO 

CO 

U 

X 

x 

z 

Z 

Z 

X. 

X 

X 

U 

U 

LJ 

CJ 

’Z 

*. 

X 

hH 

CD 

CD 

0 

03 

ca 

03 

jT) 

-7; 

O 

Or 

C>-i 

Cu 

cx 

£3 

{X 

Ox 

X 

Oi 

Ci 

cC 

(X 

ZD 

ZD 

ZD 

Oi 

Oi 

Oi 

O 

O 

O 

£- 

f- 

H 

Ci 

"T 

]H 

SI 

H* 

?! 

u. 

u. 

u- 

cu 

cu 

O 

O 

O 

O 

0 

O 

O 

•"J 

r-J 

*"3 

< 

< 

< 

0 

cu 

UJ 

U4 

U3 

U3 

ui 

U4 

CU 

0 

0 

0 

0 

0 

CJ 

CJ 

LJ 

u 

C33 

C33 

CQ 

CQ 

CQ 

r-3 

-J 

-3 

LD 

& 

a. 

a. 

X 

Si 

a* 

a. 

Oi  OS' 

0 

CO 

CO 

33) 

3D 

3D 

oi 

ci 

Oi 

O 

O 

O 

< 

< 

< 

CC 

?! 

A 

s 

SI 

U.  CL. 

0 

CC 

H 

0 

O 

O 

O 

O 

0 

-3 

>-3 

CO 

CO 

'D3 

O 

cu 

CU 

cu 

U3 

CU 

UJ 

cu 

DJ 

O O 

u 

cu 

2 

CJ 

U 

U 

a. 

a. 

u. 

u. 

U. 

CU 

Um 

cu 

a. 

CU 

cu 

CU 

cu 

u. 

cu 

u. 

cu 

CU 

u.  u. 

u. 

Zx 

LU 

u. 

a. 

to 

m 

0 

0 

0 

0 

O 

0 

0 

0 

0 

0 

O 

0 

0 

0 

c 

0 

0 

0 

O O 

O 

CJ 

a 

0 

O 

0 

5 

SI 

{- 

CQ 

H 

-4 

rU 

§ 

X 

C_J 

cu 

CO 

CQ 

CO 

H 

f- 

as 

u H 

H* 

< 

3 

~J 

a. 

oi 

u 

O 

Z 

< 

►— H 

C3 

2! 

ZD 

O 

cC 

«? 

J/J 

oi 

> 

a 

O S 

< 

OJ 

H 

H 

co 

03 

z 

z 

m 

•“> 

O't— 

CO 

X 

Q 

cu 

'"D 

>* 

z 

CQ 

< 

—I  u, 

S3 

H 

co 

Q 

O 

HNtOrtiOOOWOOHNtO^lOvDOOOOlOHNlOrfiOO 

HHHHHHHHHHNNNNN04N 


Figure  2.3  An  Example  Set  SR  of  26  Data  Base  Relations 


25 


from  A to  B.  It  is  obvious  that  if  R ^ is  a type  zero 

relation  then  r.  = a . = 6.  = B'.  = a'. 

3 3 3 3 3 

Relation  Rj (A,  B)  is  of  type  1 if  RM21(j)  = <1,  1,  1> 
and  m^  = Mj  ^ 1 • This  type  of  relation  over  A and  B is 
one  in  which  each  and  every  element  of  A is  related  to 
exactly  M j , B-values  while  at  the  same  time  each  and  every 
B-value  is  related  to  exactly  one  A-value.  It  is  clear 
that  nK  = Mj  because  m^  = M j and  also  that  a relation 
cannot  be  both  of  type  1 and  of  type  0 because  Mj  ^ 1 is 
assumed.  A type  2 relation  is  defined  somewhat  similarly 
to  a type  1 relation  as  follows.  Relation  Rj  (A,  B)  is 
of  type  2 if  RM12(j)  = <1,  1,  1>  and  n^  = Nj  7*  1 • Type  1 
and  type  2 relations  are  special  cases  of  functions  from 
B to  A and  from  A to  B,  respectively,  in  set  theoretic 
terminology . 

Relation  Rj  (A,  B)  is  of  type  3 if  RM21(j)  = <1,  1,  1> 
and  m^  / M j . The  difference  between  a type  3 and  a type  1 
relation  is  that  in  a type  1 relation  each  A-value  is 
related  to  exactly  M j B-values  whereas  in  a type  3 rela- 
tion it  may  be  related  to  any  number  of  B-values  from  m^  to 
Mj  where  mj  may  even  be  zero.  Again  by  requiring  mj  f M^ 
for  type  3 we  are  preventing  a type  1 (and  type  zero)  rela- 
tion to  be  also  classified  as  a type  3 relation.  Type  4 
relations  are  similarly  defined  as  follows.  Relation 
R j (A , B)  is  of  type  4 if  RMl2(j)  = <1,  1,  1>  and  nj  ^ N j . 
Again,  type  3 and  type  4 relations  are  functions  from  B to 
A and  from  A to  B,  respectively. 


26 


Relation  R^  (A,  B)  is  of  type  5 if  RM21(j)  = <0,  1,  1> 

and  m.  f M..  Relations  of  this  type  are  similar  to  rela- 

3 3 

tions  of  type  3 except  that  in  this  case  some  B-values  are 
related  to  no  A-value  in  R ^ because  n.  = 0 is  assumed. 
A-values  may  be  related  to  a variable  number  (including 
zero)  of  B-values.  Type  6 relations  are  similarly  defined 
as  follows.  Relation  R^  (A,  B)  is  of  type  6 if 
RM12 ( j ) = <0,  1,  1>  and  n^  ? N ^ . Finally,  if  the  type  of 
relation  R ^ (A,  B)  is  not  zero  through  6 then  it  is  defined 
to  be  of  type  7.  A relation  of  type  7 is  a many-to-many 
relation  and  either  or  both  of  iru  and  n^  may  be  zero.  In 
the  example  set  of  data  base  relations  of  Figure  2.3,  we 
observe  that  relation  R2  is  of  . pe  zero,  Rg  is  of  type  1, 
R2q  is  of  type  2,  R^  is  of  type  3,  R^2  is  of  type  4,  R^  is 
of  type  5 and  R ^ is  of  type  7. 

Figure  2.4  shows  a directed  graph  representation  of 
the  example  sets  of  data  item  types  and  data  base  relations 
of  Figures  2.1  and  2.3.  In  this  graph  a node  corresponds 
to  a data  item  type  in  SI  and  is  labeled  with  the  name  of 
that  data  item  type.  An  arc  is  directed  from  node  A to 
node  B if  there  exists  a relation  R^  e SR  such  that 
A = ORG ( R j ) and  B = DST(R^)  and  the  arc  is  labeled  with  the 

name  of  the  relation  R.. 

3 

This  diagram  shows  the  logical  associations  among 
elements  of  data  for  a typical  data  base.  The  task  of  data 
base  design  is  to  group  data  item  types  into  records  and 
relate  records  by  data  base  sets  (or  inter-record  pointers) 


i 


21 


( 


Figure  2.4  Directed  Graph  Representation  of  SI  and  SR 


28 


such  that  ir.tra-  and  inter-record  structures  are  compatible 
with  DBTG  specifications,  the  resulting  structures  are  (in 
some  predefined  sense)  optimal  and  yet  the  overall  design 
can  be  carried  out  automatical ly  and  with  reasonable  effi- 
ciency . 

The  graphical  representation  of  data  items  and  rela- 
tions is  helpful  in  the  following  few  definitions  which  we 
will  use  in  connection  with  constraint  conditions  in 
Chapter  3.  For  each  data  item  type  1^  e SI, 
i = 1,  2,  ...  , NI  two  counters  IORC(i)  and  IDSC(i)  repre- 
sent the  number  of  data  base  relations  R ^ e SR  such  that 
1^  = ORG(Rj)  and  1^  = DST( ' respectively.  In  the 
graphic  representation,  for  each  node  1^,  IORC(i)  (Item 
Origin  Count)  is  the  number  of  arcs  emanating  from  1^  and 
IDSC(i)  (Item  Destination  Count)  is  the  number  of  arcs 
incident  to  I.. 

l 

A node  1^  e SI  is  said  to  be  a sink  (source)  node  if 
IDSC(i)  = 1 and  IORC(i)  = 0 (IORC(i)  = 1 and 
IDSC(i)  = 0) . 

Note  that  our  definitions  of  a sink  and  a source  node  are 
slightly  different  from  the  conventional  ones  in  the  sense 
that  we  require  IDSC(i)  and  IORC(i)  to  be  exactly  one  for 
sink  and  source  nodes,  respectively.  The  set  SINR  of  rela- 
tions, which  is  a subset  of  SR  is  defined  as  follows: 

SINR  = {R j e SR  | DST(RJ  is  a sink  node;  i.e., 

IDSC(i)  = 1 and  IORC(i)  = 0 where  Ii  = DST (R^ ) } . 
Similarly,  the  set  SONR  of  relations,  again  a subset  of  SR 


29 


and  mutually  exclusive  with  SINR,  is  defined  as  follows: 

SONR  = {Rj  e SR  | ORG(R^)  is  a source  node;  i.e., 

IORC(i)  = 1 and  IDSC(i)  = 0 where  Ii  = ORG(R^)  . 
For  example,  in  Figure  2.4,  ORGNAM  is  a sink  node  and 
ORGCOD  is  not  a source  node. 

Given  a set  of  data  item  types  SI  and  a set  of  rela- 
tions SR  whose  elements  are  relations  over  pairs  of  data 
item  types  in  SI,  the  assumption  that  each  data  item  value 
is  related  to  at  least  one  other  data  item  value  through  a 
relation  in  SR  implies  that  each  data  item  type  in  SI 
participates  (as  origin  or  as  destination  data  item  type) 
in  at  least  one  relation  in  SR.  Alternatively  we  can  state 
that  IDSC(i)  and  IORC(i)  cannot  be  simultaneously  equal  to 
zero  for  any  i e (1,  2,  ...  , NI}.  In  graph  representation 
this  means  that  there  are  no  isolated  single  nodes  in  the 
graph.  The  destination  data  item  type  of  each  relation  R 
in  SINR  participates  in  exactly  one  relation  in  SR,  namely 
R,  and  similarly  the  origin  data  item  type  of  each  relation 
R'  in  SONR  participates  in  exactly  one  relation  in  SR, 
namely  R' . 

2.3  User  Activity 

In  the  preceding  two  sections,  those  components  of  the 
user  requirements  model  dealing  with  the  static  view  of 
data,  namely  data  items  and  data  base  relations,  were  dis- 
cussed. In  this  section  we  define  a set  of  12  data  manip- 
ulation operations  in  terms  of  which  sequences  of  data 


30 


retrieval,  storage  and  update  can  be  expressed.  These 
operations  are  defined  over  the  data  items  and  data  base 
relations  with  no  assumption  regarding  record  and  set 
structures,  and  in  that  sense  they  are  different  from  DBTG 
DML  commands.  Named  sequences  of  data  manipulation  opera- 
tions are  called  run  units.  We  define  the  12  data  manipu- 
lation operations  in  terms  of  a typical  relation  R ^ (A,  B) 
over  data  item  types  A and  B,  with  A as  the  origin  and  B 
as  the  destination  data  item  type,  respectively.  These 
operations  will  be  concerned  with  the  retrieval,  update  or 
storage  of  some  value  (or  collection  of  values)  of  one 
data  item  type  related,  through  a specified  relation,  to  a 
"given  value"  of  the  other  data  item  type  where  actual 
values  will  be  provided  at  execution  time.  For  example, 
an  operation  that  requires  the  retrieval  of  the  i-th  B-value 
related  to  a e A in  relation  R ^ (A,  B)  will  specify  i and  a 
at  execution  time. 

With  each  operation  a number  f,  called  the  operation's 
frequency,  is  associated.  The  frequency  of  an  operation  is 
the  number  of  times  that  the  operation  is  expected  to  be 
used  in  some  unit  of  time,  say  a week.  An  operation  is 
defined  as  a 5-tuple  such  as  O = <op,  B,  A,  R^ , f>  where 
op  e SOP  is  the  operation  code  (SOP  is  defined  below) , R ^ 
is  a relation  in  SR,  B is  called  the  operand  of  O,  A is 
called  the  qualifier  of  0 and  f is  the  frequency  of  the 


operation  O.  The  relation  R^  must  be  defined  over  A and  B 
either  with  A as  origin  and  B as  destination  data  item  type 


31 


(R  (A,  B) ) or  vice  versa  (R.(B,  A).  SOP  is  the  set  of 
j 3 

operation  codes  that  we  consider  in  our  work  and  it  is 
defined  as  follows: 

SOP-  = {FIND  FIRST,  FIND  LAST,  FIND  ITH,  FIND  NEXT, 

FIND  PREV.,  FIND  ALL,  FIND  KEY,  STORE  FIRST, 

STORE  LAST,  STORE  KEY,  MODIFY,  ERASE} 

Given  the  above  set  SOP  of  12  operation  codes  we  now 
define  12  operations  using  relation  R ^ (A,  B) . 

1)  < FIND  FIRST,  B,  A,  R ^ , f>  : 

Find  the  first  b e B related  to  a given  a e A in 
Rj (A,  B) , with  frequency  f . This  is  one  of  our 
retrieval  operations  requiring  the  retrieval  of  the 
first  element  of  the  set  RIM^ (a)  where  a e A will  be 
provided  at  execution  time.  For  example,  an  operation 
such  as  < FIND  FIRST,  OFRCOD,  COUNUM,  OFROFCOU,  20> 
requires  the  retrieval  of  the  offering  code  of  the 
first  offering  of  a course  whose  course  number  is 
given  and  the  two  values  of  OFRCOD  and  COUNUM  are 
related  through  OFROFCOU  relation.  The  operation  is 
expected  to  be  executed  20  times  over  a certain  time 
period  (e.g.  a week). 

2)  < FIND  LAST,  B,  A,  R ^ , f>  : 

Find  the  last  b e B related  to  a given  a e A in 
Rj  (A,  B) , with  frequency  f . This  operation  is  also  a 
retrieval  operation  and  it  requires  the  retrieval  of 


the  last  element  of  RIM^ (a)  for  a given  a,  specified 
at  execution  time. 


32 


3)  < FIND  ITH,  B,  A,  R , f>  : 

Find  the  i-th  b £ B related  to  a given  a e A in 
Rj (A,  B) , with  frequency  f.  In  this  operation  the 
i-th  B-value  related  to  a specific  A-value,say  a, is  to 
be  retrieved.  The  B-value  is  actually  the  i-th  ele- 
ment of  RIMj (a) . The  values  of  i and  a will  be  pro- 
vided at  execution  time.  An  example  of  this  kind  of 
operation  would  be  < FIND  ITH,  EMPNUM,  ORGCOD,  EMPOFORG, 
400>.  This  operation  requires  the  retrieval  of  the 
i-th  employee  number  (EMPNUM)  related  to  a given 
organization  code  (ORGCOD)  in  the  relation  EMPOFORG, 
a total  of  400  times. 

4)  < FIND  NEXT,  B,  A,  R ^ , f>  : 

Find  the  next  b e B related  to  a given  a e A in 
R_.  (A,  B)  , with  frequency  f.  This  operation  assumes 
that  a "current"  B-value,  say  b'  e B,  has  been  estab- 
lished which  is  related  to  a e A in  R.  (A,  B) . The 
operation  then  requires  the  retrieval  of  the  next, 
with  respect  to  the  current,  B-value  related  to  that 
given  A-value  in  R_.  (A,  B)  , and  it  is  expected  to  be 
used  a total  of  f times. 

5)  < FIND  PREV.,  B,  A,  R , f>  : 

Find  previous  b e B related  to  a given  a e A in 
Rj (A,  B) , with  frequency  f.  Similar  to  FIND  NEXT 
operation,  this  operation  assumes  that  a current 
B-value  related  in  to  a given  A-value,  say  a,  has 
been  established  and  now  the  prior  B-value  related  in 


33 


7) 


Rj  to  a is  tc>  be  retrieved. 

< FIND  ALL,  B,  A,  R ^ , f>  : 

Find  all  B-values  related  to  a given  a e A in 
R^<A,  B) , with  frequency  f.  Unlike  the  previous  5 
operations  where  a single  B-value  would  be  retrieved, 
in  this  operation  a collection  of  B-values,  namely 
the  set  RIM j (a)  of  all  b e B related  to  a e A in 
Rj (A,  B) , should  be  retrieved.  The  retrieved  values 
may  or  may  not  be  delivered  to  the  run  units  (put  in 
the  user  working  area)  at  the  same  time. 

< FIND  KEY,  B,  A,  R ^ , f>  : 

Find  a given  b e B related  to  a given  a e A in 
Rj (A,  B) , with  frequency  f.  It  is  clear  from  its 
definition  that  in  the  FIND  KEY  operation  both  A-  and 
B-values,  namely  a and  b respectively,  are  given  and 
the  purpose  is  to  verify  whether  <a,  b>  is  in  Rj (A,  B) . 
This  operation  is  the  last  retrieval  operation  and  the 
following  five  operations  are  update  operations. 

< STORE  FIRST,  B,  A,  R ^ , f>  : 

Store  a given  B-value,  b,  as  the  first  B-value  related 
in  Rj  to  a given  A-value,  a.  As  a result  of  this 
operation  the  B-value,  to  be  specified  at  execution 


time,  will  become  the  first  element  of  the  set  RIM^ (a) . 
This  operation  establishes  the  membership  of  <a,  b>  in 


in  the  above  manner  which  may  or  may  not  necessi- 


tate the  storage  of  individual  a and  b occurrences. 


34 


9)  < STORE  LAST,  B,  A,  R , f>  : 

Store  a given  B-value,  b,  as  the  last  B-value  related 
in  Rj  to  a given  A-value,  a,  with  frequency  f.  Except 
for  the  fact  that  STORE  LAST  operation  establishes  b 
as  the  last  member  of  RIM_.  (a)  , this  operation  is  simi- 
lar to  STORE  FIRST  operation.  In  both  of  these  opera- 
tions, however,  the  assumption  is  that  the  insertion 
of  b in  RItlj  (a)  will  not  impair  its  ordering. 

10)  < STORE  KEY,  B,  A,  R ^ , f>  : 

Store  a given  B-value,  b,  in  the  usual  order  of 
B-values  related  to  a given  A.-value,  a,  with  frequency 
f.  This  operation  establishes  the  membership  of 
<a,  b>  in  R^  by  inserting  b in  RIM^ (a)  such  that  its 
order  is  not  impaired. 

11)  < MODIFY,  B,  A,  Rj,  f>  : 

Modify  a given  b e B related  to  a given  a t A in  R.  to 
another  b'  e B with  frequency  f.  Similar  to  previous 
operations  the  actual  values  of  a,  b and  b'  will  be 
provided  at  execution  time.  Note  that  as  the  result 
of  this  operation  the  former  B-value,  b,  may  be 
deleted  from  the  data  base  altogether  and  that  the  new 
B-valuo,  b' , is  placed  in  a position  in  RIM^ (a)  where 
its  ordering  is  not  impaired. 

12)  < ERASE,  B,  A,  R ^ , f>  : 

Erase  (delete)  a given  B-value,  b,  related  to  a given 


a c A in  Rj,  with  frequency  f.  This  operation  requires 
the  removal  of  the  pair  <a,  b>  from  R ^ which  in  turn 


may  or  may  not  necessitate  the  deletion  of  either  or 
both  a and  b occurrences  from  the  data  base. 


As  mentioned  earlier  in  this  section,  named  sequences 

of  data  manipulation  operations  are  called  run  units  and 

are  denoted  by  RUk,  k = 1,  2,  ...  , NU  where  NU  is  the 

number  of  run  units  defined  for  a specific  application. 

The  set  of  all  run  units  defined  for  an  application  is 

denoted  by  SU:  SU  = {RU^,  RU2,  . ..  , RU^}.  In  this  set 

each  RUk»  k = 1,  ...  , NU  is  a named  sequence  of  NO^  opera- 
• Jc 

tions  O^,  £ = 1,  2,  ...  , NO^ . Figure  2.5  illustrates  an 
example  set  of  NU  = 4 run  units  RU^,  ...  , RU^  called 
PAYROLL,  PERSONNEL,  ADMINISTRATION  and  EDUCATION,  respec- 
tively. Run  units  RU^,  RU2,  RU^  and  RU^  of  Figure  2.5  have 
the  following  number  of  operations:  NO^  = 12,  N02  = 28, 

NO,  = 15  and  NO.  = 26. 

3 4 


Note  that  each  of  the  12  operations  defined  above  on 
relation  R^ (A,  B)  can  also  be  specified  in  the  reverse 
order  of  data  item  types  A and  B with  similar  definitions, 
i.e.,  A may  be  specified  as  the  operand  and  B as  the  quali- 
fier data  item  type  of  the  operation.  For  example,  an 
operation  such  as  < FIND  ALL,  A,  B,  R_.  , f>  means:  find  all 
A-values  related  to  a given  b e B in  R , (A,  B)  with  fre- 
quency f.  The  retrieved  set  of  A-values  in  this  case  is 
LIMj (b)  instead  of  RIM^(a)  in  the  case  of  operation  number 
6 above . 


Having  defined  data  manipulations  operations  and  run 


* 


OPERATION 

OPERAND 

QUALIFIER 

RELATION 

FREQUENCY 

1 PAYROLL 

FIND  FIRST 

EMPNAM 

EMPNUM 

NAMOFEMP 

5000 

FIND  FIRST 

EMPADR 

EMPNUM 

ADROFEMP 

5000 

FIND  FIRST 

JOBCOD 

EMPNUM 

JOBOFEMP 

5000 

MODIFY 

ATHSAL 

JOBCOD 

SALOFJOB 

1000 

FIND  NEXT 

ATHSAL 

JOBCOD 

SALOFJOB 

2000 

FIND  FIRST 

ATHSAL 

JOBCOD 

SALOFJOB 

1200 

FIND  FIRST 

JOBCOD 

ATHSAL 

SALOFJOB 

20 

FIND  ALL 

ATHSAL 

JOBCOD 

SALOFJOB 

1000 

FIND  FIRST 

ATHMIN 

ATHSAL 

MINOFSAL 

100 

FIND  FIRST 

ATHMAX 

ATHSAL 

MAXOFSAL 

50 

MODIFY 

DDCTNS 

ATHSAL 

DDCOFSAL 

20 

FIND  FIRST 

DDCTNS 

ATHSAL 

DDCOFSAL 

100 

2 PERSONNEL 

FIND  ALL 

EMPNUM 

ORGGOD 

EMPOFORG 

40 

FIND  NEXT 

EMPNUM 

ORGCOD 

EMPOFORG 

800 

STORE  KEY 

EMPNUM 

ORGCOD 

EMPOFORG 

20 

FIND  ITH 

EMPNUM 

ORGCOD 

EMPOFORG 

400 

MODIFY 

EMPNUM 

ORGCOD 

EMPOFORG 

500 

FIND  FIRST 

EMPADR 

EMPNUM 

ADROFEMP 

1000 

STORE  FIRST 

EMPADR 

EMPNUM 

ADROFEMP 

20 

FIND  FIRST 

EMPNAM 

EMPNUM 

NAMOFEMP 

1000 

STORE  FIRST 

EMPNAM 

EMPNUM 

NAMOFEMP 

20 

FIND  FIRST 

EMPLVL 

EMPNUM 

LVLOFEMP 

1000 

STORE  FIRST 

EMPLVL 

EMPNUM 

LVLOFEMP 

20 

STORE  FIRST 

BRTDAT 

EMPNUM 

BRTOFF.MP 

20 

FIND  ALL 

JOBCOD 

EMPNUM 

JHSOFEMP 

1000 

FIND  NEXT 

JOBCOD 

EMPNUM 

JHSOFEMP 

3000 

MODIFY 

JHSYRS 

EMPNUM 

YRSOFEMP 

2 

FIND  KEY 

JOBCOD 

EMPNUM 

JHSOFEMP 

700 

MODIFY 

JOBCOD 

EMPNUM 

JHSOFEMP 

500 

FIND  NEXT 

EMPNUM 

JOBCOD 

JHSOFEMP 

800 

MODIFY 

EMPNUM 

ORGCOD 

EMPOFORG 

10 

FIND  FIRST 

JOBCOD 

EMPNUM 

JOBOFEMP 

1000 

FIND  ALL 

JOBCOD 

EMPNUM 

JOBOFEMP 

2000 

FIND  ALL 

EMPNUM 

JOBCOD 

JOBOFEMP 

500 

MODIFY 

JOBCOD 

EMPNUM 

JOBOFEMP 

2000 

FIND  NEXT 

JOBCOD 

EMPNUM 

JOBOFEMP 

3000 

STORE  FIRST 

EMPMST 

EMPNUM 

MSTOFEMP 

20 

MODIFY 

EMPMST 

EMPNUM 

MSTOFEMP 

10 

FIND  FIRST 

EMPMST 

EMPNUM 

MSTOFEMP 

1000 

FIND  ALL 

EMPNUM 

BRTDAT 

BRTOFEMP 

2 

Figure  2.5  An  Example 

Set  SU  of  4 

Run  Units 

i 


37 


BtSl  AVAllASLt 


OPERATION  OPERAND 

3 ADMINISTRATION 

QUALIFIER 

RELATION 

FREQUENCY 

FIND  FIRST 

JOBTTL 

JOBCOD 

TTLOFJOB 

20 

MODIFY 

JOBTTL 

JOBCOD 

TTLOFJOB 

10 

FIND  ALL 

EMPNUM 

ORGCOD 

EMPOFORG 

50 

FIND  FIRST 

BUDGET 

ORGCOD 

BGTOFORG 

50 

FIND  FIRST 

BUDGET 

ORGCOD 

BGTOFORG 

50 

FIND  FIRST 

ORGNAM 

ORGCOD 

NAMOFORG 

50 

MODIFY 

JOBCOD 

ORGCOD 

JOBOFORG 

35 

FIND  ALL 

JOBCOD 

ORGCOD 

JOBOFORG 

40 

FIND  ALL 

ORGCOD 

JOBCOD 

JOBOFORG 

400 

FIND  LAST 

JOBCOD 

ORGCOD 

JOBOFORG 

30 

FIND  NEXT 

JOBCOD 

ORGCOD 

JOBOFORG 

40 

MODIFY 

ATHQNT 

JOBCOD 

QNTOFJOB 

90 

MODIFY 

BUDGET 

ORGCOD 

BGTOFORG 

20 

MODIFY 

ATHMIN 

ATHSAL 

MINOFSAL 

45 

MODIFY 

ATHMAX 

ATHSAL 

MAXOFSAL 

50 

4 EDUCATION 


FIND  ALL 

OFRCOD 

EMPNUM 

TEACHERS 

100 

FIND  NEXT 

OFRCOD 

EMPNUM 

TEACHERS 

1000 

FIND  PREV. 

OFRCOD 

EMPNUM 

TEACHERS 

1000 

FIND  ALL 

EMPNUM 

OFRCOD 

TEACHERS 

1000 

FIND  NEXT 

EMPNUM 

OFRCOD 

TEACHERS 

1000 

FIND  PREV. 

EMPNUM 

OFRCOD 

TEACHERS 

1000 

STORE  LAST 

OFRCOD 

EMPNUM 

TEACHERS 

10 

ERASE 

OFRCOI) 

EMPNUM 

TEACHERS 

5 

MODI FY 

EMPNUM 

OFRCOD 

TEACHERS 

20 

STORE  KEY 

EMPNUM 

OFRCOD 

STUDENTS 

15 

FIND  KEY 

EMPNUM 

OFRCOD 

STUDENTS 

100 

FIND  KEY 

EMPNUM 

OFRCOD 

TEACHERS 

40 

FIND  FIRST 

EMPNUM 

OFRCOD 

STUDENTS 

1000 

FIND  PREV. 

EMPNUM 

OFRCOD 

STUDENTS 

200 

FIND  NEXT 

EMPNUM 

OFRCOD 

STUDENTS 

200 

FIND  ALL 

OFRCOD 

EMPNUM 

STUDENTS 

1000 

FIND  FIRST 

COUNUM 

OFRCOD 

OFROFCOU 

200 

FIND  LAST 

OFRCOD 

COUNUM 

OFROFCOU 

20 

MODIFY 

OFRCOD 

COUNUM 

OFROFCOU 

800 

FIND  PREV. 

OFRCOD 

COUNUM 

OFROFCOU 

70 

FIND  FIRST 

OFRFMT 

OFRCOD 

FMTOFOFR 

200 

FIND  FIRST 

OFRDAT 

OFRCOD 

DATOFOFR 

200 

FIND  FIRST 

OFRLOC 

OFRCOD 

LOCOFORF 

200 

FIND  FIRST 

OFRCOD 

OFRLOC 

LOCOFOFR 

200 

FIND  FIRST 

COUTTL 

COUNUM 

TTLOFCOU 

100 

FIND  FIRST 

COUIJSP 

COUNUM 

DSPOFCOU 

100 

Figure  2.5  (Cont'd.) 


38 


units  we  will  now  define  an  additional  variable  for  each 
relation  e SR,  which  is  the  count  of  run  units  that  use 
R_.  in  at  least  one  of  their  operations.  The  set  of  vari- 
ables RUUR j , j = 1,  2,  ...  , MR,  is  then  defined  as  follows: 

RUUR . = I SUR  . I 
3 3 

where  SUR.  = {RU,  e SU  | there  exists  1 < £ < NO,  such 

3 K " K 

that  R.  is  used  in  0.}. 

3 

These  variables  will  be  necessary  in  Chapter  3 where  we 
will  define  storage  costs. 

Furthermore,  for  each  run  unit  RU^  e SU, 
k = 1,  ...  , NU,  we  define  the  following  two  sets. 

PSI^  = {I  c SI  | for  some  1 £ £ 5 NO^,  I is  either  the 
qualifier  or  the  operand  data  item  type  of 


PSR^  = {R  s SR  | for  some  1 £ £ £ NO^,  R is  the  rela- 
tion used  in  0^}. 

We  recall  that  0^  is  by  definition  the  £-th  operation  of 
k-th  run  unit.  For  each  RU^,  k = 1,  ...  , NU,  PSI^  is  the 
set  of  data  items  that  run  unit  RU,  uses  and  PSR,  is  the 
set  of  relations  used  by  RU^.  These  two  collections  of 
sets  will  be  used  in  connection  with  the  constraint  condi- 
tions defined  in  Chapter  3.  PSI^,  k = 1,  2,  ...  , NU  are 

not  necessarily  mutually  exclusive  sets.  The  same  proposition 

NU 

holds  for  PSR.  , k = 1,  2,  ...  NU.  Furthermore  U PSR,  is 
k k=l  K 

not  necessarily  equal  to  the  set  SR.  In  the  case  where  the 
NU 

union  U PSR.  is  a proper  subset  of  SR,  the  time  cost  of 
k=l  K 

relations  in  SR  that  are  not  members  of  the  union  will  be 


39 


zero . 

In  the  first  run  unit  RU^  shown  in  Figure  2.5,  for 
example,  PSI^  = {Rg , R?,  RQ , Rg , R12,  R^,  R } and 

PSI1  = (l4'  I6'  I7'  I8'  I9'  111'  I12'  I16}* 

The  sequence  of  operations  that  comprise  a run  unit  is 

intended  only  to  model  the  systems  view  of  data  retrieval 
and  update  expected  to  occur  over  the  data  base  by  the  run 
unit.  In  that  sense,  those  operations  do  not  model  the 
exact  DBTG  data  manipulation  commands  nor  do  they  correspond 
to  such  commands  in  a one-to-one  relationship.  Furthermore, 
different  operations  of  a run  unit  (or  different  executions 
of  one  such  operation)  may  not  necessarily  occur  in  the 
order  of  their  appearance  in  the  run  unit.  The  operations 
included  in  run  units  in  SU  will  be  used  to  compute  time 
costs  using  time  cost  functions  of  Chapter  3.  The  order  of 
operations  in  run  units  is  immaterial  in  the  resulting  time 
costs.  Furthermore,  operations  with  zero  frequencies  do 
not  contribute  to  time  costs.  Such  operations  may  be  used 
however  to  expand  PSR^  and  PSI^  sets  if  necessary. 

2.4  Storage  Space  Parameters 

In  this  section  we  define  the  parameters  that  model 
the  storage  requirements  of  an  application.  This  is  the 
last  component  of  the  user  requirements  model.  Values  for 
all  of  the  storage  space  parameters  have  to  be  specified 
by  the  data  base  designer  who  is  using  this  methodology. 

The  storage  space  referred  to  here  consists  of  space  needed 


WWH'.lUl.H- 


I 


40 

to  store  data  characters  of  the  data  base,  pointer  and 
counter  variables,  and  the  space  needed  to  store  certain 
privacy  locks.  It  does  not  include  some  physical  space 
requirements  such  as  access  method  tables  that  are  not 
modelled  here. 

A complete  description  of  the  storage  space  require- 
ments for  an  application  consists  of  specifying  values  for 
the  following  six  parameters. 


1) 

TMSB  : 

Maximum  Storage  Bound 

2) 

PGSZ  : 

Page  Size 

3) 

MSPG  : 

Main  Storage  Pages 

4) 

PTRS  : 

Pointer  Size 

5) 

CNTRS : 

Counter  Size 

6) 

PLOKS: 

Privacy  Lock  Size 

The  storage  space  can  be  viewed  as  a sequence  of 
storage  locations  each  capable  of  holding  one  character 
(one  byte)  of  information.  Associated  with  each  storage 
location  is  a unique  string  of  fixed  length  called  the 
address  of  that  storage  location.  Address  strings  are 
ordered  with  the  same  ordering  defined  in  Section  2.1  for 
strings  of  fixed  length.  The  maximum  size  of  the  storage 
space  is  given  by  TMSB.  This  is  the  absolute  maximum 
storage  limit  that  the  designer  specifies  at  the  beginning 
of  the  optimization  process.  The  optimization  algorithm 
of  Chapter  4 finds  a spectrum  of  data  base  designs  each  of 
which  is  optimal  over  a range  of  storage  space  limits 


41 


below  TMSB.  TMSB  is  measured  in  units  of  one  character 
(bytes)  . 

The  storage  space  in  which  the  data  base  will  be 
stored  resides  on  a secondary  storage  medium  and  is  divided 
into  sub-divisions  of  equal  size  given  by  PGSZ,  again 
measured  in  bytes.  Accesses  to  secondary  storage  are  made 
in  units  of  one  page  and  at  most  a given  number  MSPG  of 
data  pages  can  be  in  main  storage  at  a given  time.  PTRS  is 
the  length  of  an  address  string  (or  a data  base  key  in 
DBTG  terminology)  measured  in  units  of  one  byte.  An 
address  occurrence  is  a contiguous  sequence  of  PTRS  storage 
locations  that  contain  either  an  address  string  or  a null 
value.  CNTRS  is  the  size  of  a counter  data  item,  measured 
in  units  of  one  byte,  used  to  hold  the  length  of  a variably 
dimmentioned  vector  or  repeating  group.  Finally,  PLOKS  is 
the  size,  in  bytes,  of  a privacy  lock. 

A typical  storage  space  requirements  specification  is 
the  following. 

TMSB  = 1,600,000  bytes 


PGSZ  = 2000  bytes 
MSPG  = 2 pages 
PTRS  = 4 bytes 
CNTRS  = 6 bytes 
PLOKS  =20  bytes 


In  this  chapter  we  have  presented  the  four  components 
of  the  user  requirements  model,  namely  a set  of  data  item 
types  SI,  a set  of  data  base  relations  SR  defined  over 


42 


elements  of  SI,  a set  of  run  units  SU  consisting  of  oper- 
ations that  use  elements  of  SR  and  a set  of  six  storage 
space  parameters.  In  Chapter  3 we  define  a number  of  imple 
mentation  alternatives  for  a data  base  relation.  The  infor 
mation  provided  in  SI,  SR,  SU  and  storage  space  parameters 
will  enable  us  to  determine  acceptability,  storage  cost  and 
time  cost  of  relation  implementations.  In  turn,  the  infor- 
mation provided  by  costing  functions  of  Chapter  3 will  be 
used  to  set  up  and  solve  an  optimization  problem,  in 
Chapter  4,  which  will  determine  the  optimal  data  base 
design  for  the  application  described  by  SI,  SR,  SU  and 
storage  space  parameters.  The  user  requirements  model  may, 
then,  be  viewed  as  the  input  model  to  our  data  base  design 
methodology.  Also  in  this  chapter  we  have  given  an 
example  of  a completely  specified  application. 


CHAPTER  III 


I 


IMPLEMENTATION  ALTERNATIVES  AND  COSTS 

The  implementation  of  a data  base  relation  R (A,  B) 
determines  the  way  the  relation  between  data  item  types  A 
and  B will  actually  be  realized  in  the  data  base.  In  that 
sense  the  choice  of  exactly  one  implementation  for  each  and 
every  data  base  relation  in  SR  (set  of  all  relations  for  a 
specific  application)  will  determine  the  overall  data 
structure  of  the  data  base,  upto  the  relative  ordering  of 
data  items  in  a record.  There  may  be  more  than  one 
possible  implementation  for  a relation  and  it  is  up  to  the 
optimization  algorithm  to  decide  which  alternative  should 
be  selected  for  each  relation  so  that  the  overall  cost  is 
minimum. 

In  Section  3.1  we  will  define  a set  of  24  implementa- 
tions of  a data  base  relation.  It  is  our  intention  to 
include  a sufficient  number  of  implementation  alternatives 
in  the  model  so  that  as  much  use  as  possible  can  be  made 
of  the  data  structuring  capability  of  the  network  (DBTG) 

data  model.  In  Section  3.2  we  will  present  a set  of  con- 
straint conditions  for  each  relation  implementation  alter- 
native. Basically  these  constraint  conditions  are  a set  of 
rules  which  guarantee  that  data  structures  which  are  illegal 
in  a DBTG  data  model  would  not  result  from  the  optimiza- 
tion process.  These  rules  will  also  guarantee  that  the 
security  requirements  of  run  units  will  always  be  satisfied. 
We  will  present  the  formulas  for  computing  storage  costs 


43 


■ 


44 


f 


of  relation  implementations  in  Section  3.3.  The  page 
fault  probability  model  used  in  this  work  is  discussed  in 
Section  3.4  and  it  is  used  as  a starting  point  for  the 
derivation  of  time  costs  in  Section  3.4. 

Throughout  this  chapter,  we  will  use  a typical  relation 
R ^ e SR  characterized  by  the  following  variables  in  our 
definitions.  Subscript  j will  be  suppressed  where  no 
ambiguity  can  arise. 

R . = R ( RNAME , A,  B,  r - , <m.,  m.,  M.>,  <n.,  n.,  N.>) 

3 3 3 3 3 3 3 3 

where 


j e (1 , . . . 


NR}  is  the  ordinal  number  of  R.  e SR  and 

3 

NR  = | SR | 


RNAME  = RNAM(j) 
A = ORG(R^) 

B = DST(RJ 


r . 
3 


<m 


M > 

3 


< n . 
3 


9 


N . > 

3 


is  the  name  of  the  relation  R . e SR 

3 

is  the  origin  data  item  type  of  R^ 

is  the  destination  data  item  type  of 

R. 

3 

is  the  number  of  ordered  pairs  in  R. 

3 

and  is  called  the  cardinality  of  the 

relation  R. 

3 

is  an  ordered  3-tuple  representing 
the  minimum,  average  and  maximum 
number  of  B-values  related  to  one 
A-value  in  R ^ , respectively 
is  an  ordered  3-tuple  representing 
the  minimum,  average  and  maximum 
number  of  A-values  related  to  one 


l 


B-value  in  R^ , respectively 


45 


The  variables  associated  with  data  item  types  A and  B that 
will  be  refered  to  in  this  chapter  are  the  following, 
a!  = | A | is  the  total  number  of  A-values 

called  the  cardinality  of  data  item 
type  A,  with  8^  similarly  defined  for 


a 


j 


a". 

3 


B 

is  the  number  of  A-values 

at  least  one  B-value  in  R. 

D 

defined  similarly 

size  of  data  item  A,  with 

similarly  for  B. 


related  to 
, with  8^ 

8"  defined 
3 


3.1  Implementation  Alternatives 

The  implementation  alternatives  discussed  in  this 
section  fall  into  two  major  categories.  The  first  category 
contains  implementations  that  group  data  items  together  to 
form  records.  The  second  category  contains  implementations 
that  associate  records  by  means  of  data  base  sets  or 
explicit  pointer  data  items.  The  motivation  for  this  type 
of  categorization  is  the  following.  In  DBTG  data  struc- 
tures, relationships  may  be  implicit  or  explicit.  An 
example  of  an  implicit  relationship  is  the  relationship 
between  the  JOBCOD  and  EMPNUM  data  items  when  these  two 
data  items  are  defined  as  parts  of  the  same  record.  An 
explicit  relationship,  on  the  other  hand,  is  one  that 
associates  related  values  of  the  two  data  item  types 
(residing  in  two  different  records)  using  data  base  sets. 


46 


» 


Figure  3.1  shows  a listing  of  the  24  implementations 
included  in  the  model.  In  the  following  we  will  give  the 
definitions  of  these  implementation  alternatives.  The 


subscript  j will  be  omitted  from  references  to  relation 
and  variables  that  characterize  throughout  this  section, 


3.1.1  Fixed  Duplications 

Implementation  1 for  R(A,  B)  is  called  the  fixed 
duplication  of  B under  A.  It  consists  of  a set  of  a' 
ordered  pairs  such  as 

{<a1#  fd1(B)>,  <a2,  fd2(B)>,  , <aa,,  fda,(B)>}. 

In  the  above  set  a^,  i = 1,  ...  , a'  are  unique  A-values 
and  fd^ (B)  is  a vector  of  M elements  (M  is  the  maximum 
number  of  B-values  that  may  be  related  to  one  A-value  in 
R ( A , B) ) . The  vector  fd^(B)  contains  one  copy  of  each 
B-value  that  is  related  to  a^  in  R(A,  B)  , for  i = 1,  ...  ,a\ 
There  are  a ordered  pairs  <a^,  fd^ (B) > for  which  fd^ (B) 
contains  at  least  one  non-null  B-value  and  for  the  rest  of 
the  ordered  pairs  in  the  set  fd^(B)  consists  of  M null 
B-values.  On  the  average  for  each  pair  <a^,  fd^ (B) > in  the 
set  r/a'  non-null  values  (copies  of  B-values)  exist  in 
fd^(B).  For  all  i = 1,  ...  , a'  the  vector  fd^(B)  is 
stored  in  the  same  record  in  which  a^  is  stored  (Figure  3.2). 

Implementation  number  2 for  R(A,  B)  is  called  the 
fixed  duplication  of  A under  B.  The  definition  of  this 
implementation  is  similar  to  that  of  implementation  1,  and 
it  consists  of  a set  of  ordered  pairs  such  as 


47 


( 1)  Fixed  Duplication  of  B under  A 
( 2)  Fixed  Duplication  of  A under  B 
( 3)  Variable  Duplication  of  B under  A 
( 4)  Variable  Duplication  of  A under  B 
( 5)  Fixed  Aggregation  of  B under  A 
( 6)  Fixed  Aggregation  of  A under  B 
( 7)  Variable  Aggregation  of  B under  A 
( 8)  Variable  Aggregation  of  A under  B 

( 9)  Chain  with  Next  Pointers  Association  of  B under  A 

(10)  Chain  with  Next  Pointers  Association  of  A under  B 

(11)  Chain  with  Next  and  Owner  Pointers  Association  of  B 

under  A 

(12)  Chain  with  Next  and  Owner  Pointers  Association  of  A 
under  B 

(13)  Chain  with  Next  and  Prior  Pointers  Association  of  B 
under  A 

(14)  Chain  with  Next  and  Prior  Pointers  Association  of  A 
under  B 

(15)  Chain  with  Next,  Prior  and  Owner  Pointers  Association 
of  B under  A 

(16)  Chain  with  Next,  Prior  and  Owner  Pointers  Association 
of  A under  B 

(17)  Pointer  Array  Association  of  B under  A 

(18)  Pointer  Array  Association  of  A under  B 

(19)  Pointer  Array  with  Owner  Pointers  Association  of  B 
under  A 

Figure  3.1  A Listing  of  the  24  Implementations. 


48 


(20)  Pointer  Array  with  Owner  Pointers  Association  of  A 
under  B 

(21)  Dummy  Record  association  of  A and  B 

(22)  Single  Linkage  of  B under  A 

(23)  Single  Linkage  of  A under  B 

(24)  Double  Linkage  of  A and  B 


Figure  3.1  (Cont'd). 


49 


{<b1#  fd1(A)>,  <b2,  fd2(A)>,  ...  , <bpi,  fd  p,  (A)  > > . 

In  the  above  set  b^,  ...  , b , are  unique  B-values  and 

fd^ (A) , 1=1,  ...  , 6'  is  a vector  of  size  N (maximum 

number  of  A-values  related  to  one  B-value  in  R) . Each 
fd^ (A)  vector  contains  an  average  of  r/6'  non-null  values, 
namely  copies  of  A-values,  that  are  related  to  b^  in  R. 

For  all  i = 1,  ...  , 6*  the  vector  fd^ (A)  is  stored  in  the 

same  record  in  which  b^  is  stored  (Figure  3.3). 

3.1.2  Variable  Duplications 

The  next  two  implementations  (numbered  3 and  4)  use 
variable  length  vectors  of  duplicate  values  and  they  are 
defined  below. 

Implementation  3 for  R(A,  B)  is  called  the  variable 
duplication  of  B under  A and  it  consists  of  the  following 
set  of  ordered  3-tuples. 

{< , Cj^,  vdx(B)>,  <a2,  c2,  vd2(B)>,  ...  , 

<aa' ' ca* ' vda*  (B) >} * 

In  this  set  a^ , ...  , a^,  are  unique  A-values,  vd^(B)  is  a 

variable-length  vector  of  B-values  and  c^  is  a counter 

data  item  occurrence  of  size  CNTR3,  for  i = 1,  ...  , a'. 

For  each  i = 1,  ...  , the  vector  vd^ (B)  and  the 

counter  c.  are  stored  in  the  same  record  in  which  a.  is 
l l 

stored;  vc^(B)  contains  one  copy  of  each  B-value  related 

to  a . in  R and  c.  contains  the  number  of  these  B-values. 
x i 


The  average  length  of  vd^(B)  vectors  is  r/a'  and  its  range 
is  from  zero  to  M.  These  vectors  do  not  contain  any  null 


51 


f 


values  (Figure  3.4). 

Implementation  4 for  R(A,  B)  (the  converse  of  imple- 
mentation 3)  is  called  the  variable  duplication  of  A under 
B,  and  it  consists  of  the  following  set  of  ordered  3-tuples. 
{< bi , c , vd1(A)>,  <b2,  c2,  vd2(A)>,  ...  , 

<bp|/  Cp,,  vdp  , (A)  > } . 

In  this  set  b^,  b2,  ...  , bg,  are  unique  B- values , vd^ (A)  is 

a variable-length  vector  of  duplicate  A-values  related  to 

b^  in  R and  c^  is  a counter  data  item  occurrence  (of  size 

CNTRS)  that  contains  the  number  of  A-values  related  to  b^ 

(length  of  vd^ (A) ) , i = 1,  ...  , S'.  For  all  i = 1,  ...,6' 

the  vector  vd. (A)  and  the  counter  c.  are  stored  in  the  same 
x x 

record  in  which  b^  is  stored  (Figure  3.5). 

3.1.3  Fixed  Aggregations 

Implementation  5 for  R(A,  B)  is  called  the  fixed 
aggregation  of  B under  A and  it  consists  of  a set  of 
ordered  pairs  such  as: 

{<a1#  fa1(B)>,  <a2,  fa2(B)>,  <aa,,  faa,(B)>}. 

In  this  sot  a^,  a 2#  ...  , a^,  are  unique  A-values,  fa^(B) 
is  a vector  of  fixed  length  M (maximum  number  of  B-values 
related  to  one  A-value  in  R)  which  contains  B-values 
related  to  a^  in  R and  fa^(B)  is  stored  in  the  same  record 
in  which  a^  is  stored.  The  B-values  in  vectors  of  this 
implementation  are  not  duplicate  copies  but  rather  they 
can  be  used  to  implement  other  relations  defined  over  B. 

In  particular  if  another  data  item  type  D is  aggregated 


Figure  3.5  Variable  Duplication  of  A under  B 


53 


under  B (to  implement  a relation  such  as  p!(B,  D)  for 
example)  the  vectors  of  D-values  must  be  stored  with  each 
B-value  in  fa^(B)  and  since  data  items  D and  B are  not 
required  to  be  of  the  same  size,  this  can  generate  sequences 
of  repeating  groups.  There  are,  of  course,  limitations  to 
the  use  of  this  implementation.  We  will  discuss  these 
limitations  when  we  present  the  constraint  conditions  in 
Section  3.2.  Figure  3.6  shows  the  fixed  aggregation  of  B 
under  A and  Figure  3.7  shows  the  combination  of  implemen- 
tation 5 for  R (A,  B)  and  R' ( 6 , D) . 

Implementation  6 for  R(A,  B)  is  called  the  fixed 
aggregation  of  A under  B.  This  implementation  is  the  con- 
verse of  implementation  5 and  it  consists  of  a set  of 
oraared  pairs  such  as: 

{<  b^ , fa1(A)>,  <b2,  f a2  (A)  > , ...  , <bp,,  fap,(A)>}. 
Again  in  this  set  b^,  ...  , bg,  are  unique  B-values, 
fa^  (A)  is  a vector  of  A-values  related  to  b^  in  R and  it 
is  stored  in  the  same  record  in  which  b^  is  stored, 
i = 1,  ...  , 6'.  The  fixed  size  of  fa^ (A)  vectors  is  N, 
which  is  the  maximum  number  of  A-values  related  to  one 
B-value  in P.  A-values  in  the  vectors  of  this  implementation 
are  shared  occurrences  in  the  sense  that  they  must  be  used 
in  the  implementations  of  all  other  relations  over  A in 
which  A-values  are  not  duplicated. 


3.1.4  Variable  Aggregations 

The  next  two  implementations  (numbered  7 and  8)  are 


55 


similar  to  fixed  aggregations  except  that  in  this  case  they 
use  variable  length  vectors  or  repeating  groups.  Imple- 
mentation 7 for  R (A,  B)  is  called  the  variable  aggregation 
of  B under  A and  it  consists  of  a set  of  ordered  3-tuples 
such  as: 

{<ai#  c1 , va1(B)>,  <a2,  c2 , va2(B)>,  , 

<aa' ' ca ’ ' vaa 1 (B)>}* 

In  the  above  set  a^ , ...  , aa,  are  unique  A-values,  va^(B) 
is  a variable  length  vector  (or  repeating  group)  containing 
B-values  related  to  a^  in  R and  c^  is  a counter  data  item 
occurrence  (of  size  CNTRS)  that  contains  the  length  of  the 
vector  va^(B),  for  i = I,  ...  , a ' . The  vector  va^ (B)  and 
data  item  occurrence  c^  are  stored  in  the  same  record  in 
which  a is  stored.  B-values  in  the  vectors  of  this 

l 

implementation  are  shared  occurrences  in  the  sense  that  the 
implementations  of  all  other  relations  over  B which  do  not 
use  duplicate  B-values  should  use  the  same  occurrences  of 
data  item  type  B. 

Implementation  8 for  R(A,  B) , called  variable  aggre- 
gation of  A under  B,  is  the  converse  of  implementation  7 
and  it  consists  of  the  set  of  ordered  3-types: 

{cb^  c1#  va  (A)>,  <b2,  c2>  va2(B)>,  ...  , 

<b?1  , cpi  , va p | (A)  > } . 

In  this  set  b.  , ...  , bD,  are  unique  B-values,  va. (A)  is  a 
lb  1 

vector  of  A-values  related  to  in  R,  c^  is  a counter  data 
item  occurrence  that  contains  the  length  of  va^ (A)  and 
finally  c.  and  va^A)  are  stored  in  the  same  record  in 


56 


which  tK  is  stored,  i = 1,  ...  , 6'.  Again  B-values 
stored  in  the  variable-length  vectors  of  this  implementa- 
tion are  shared  occurrences. 

Implementations  1 through  8 constitute  the  first  cate- 
gory of  implementations,  namely  those  that  require  the  two 
data  item  types  A and  B of  the  relation  R (A,  B)  to  be 
defined  in  one  record  type.  These  implementations  deter- 
mine the  intra-record  structure  of  the  data  base  whereas  the 
second  category  of  implementations  (to  be  defined  in  the 
following  parts  of  this  section)  determine  the  inter-record 
structure  of  the  data  base.  Implementations  1 through  8 
are  similar  to  those  defined  in  IMitcma  1975]  . 

3.1.5  Chain  with  Next  Pointers  Associations 

Implementation  9 for  R(A,  B)  is  called  the  chain 
with  next  pointers  association  of  B under  A.  Implementa- 
tion 9 consists  of  the  following  two  sets  of  ordered 
pairs . 

OWNR  = {<ax,  mpx>,  <a2,  nip2>,  ...  , <aQ,,  mpa,>}, 

MMBR  = (<b1,  r.p1>,  < b2 , np2>,  ...  , < b^  , , np^  , > } 

where 

i)  a^,  ...,  a^,  and  b^ , ...  , b^,  are  unique  A-  and 

B-values,  respectively 

ii)  mpi  is  an  address  occurrence  (pointer)  that  con- 
tains : 

1)  a null  address  value  if  a^  is  related  to  no 
B-value  in  R 


R 


57 


2)  the  address  of  the  record  that  contains  the 
first  B-value  related  to  a.^  in  R,  otherwise 
iii)  np^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  that  contains  the 
A-value  related  to  in  R,  if  b^  is  the 
last  B-value  related  to  a . in  R, 

3)  the  address  of  the  next  B-value  related  to 
a^  in  R,  otherwise 

iv)  mp.  and  np,  are  stored  in  the  same  records  in 

X K 

which  a^  and  b^  are  stored,  respectively. 

i = 1,  ...  , a'  , k = 1,  ...  , B' 

This  implementation  establishes  the  relation  R(A,  B)  in 
the  data  base  through  data  base  sets.  Each  data  base  set 
of  this  implementation  contains  exactly  one  owner  record 
in  which  a unique  A-value,  a^,  is  stored  and  zero  or  more 
member  records  in  each  of  which  a unique  B-value,  related 
to  a^  in  R,  is  stored.  Data  base  sets  of  implementation  9 
are  stored  as  chains  of  records  in  the  sense  that  in  each 
set,  the  owner  record  contains  a pointer  mp^  in  which  the 
address  of  the  first  member  record  is  stored  and  each  mem- 
ber record  of  the  set  contains  a pointer  np^  in  which  the 
address  of  the  next  member  record  is  stored.  The  next 
pointer,  np^»  of  the  last  member  record  contains  the 
address  of  the  owner  record.  The  set  of  all  set  occurrences 
of  this  implementation  constitutes  a set  type  whose  owner 


58 

record  type  is  the  set  of  all  owner  records.  The  set  of 
all  member  records  of  these  set  occurrences  is  a subset  of 
the  member  record  type  of  the  set  type.  The  next  member 
record  pointers,  np^,  in  the  records  that  belong  to  the 
member  record  type  but  do  not  belong  to  any  data  base  set 
(of  this  implementation)  contain  null  address  occurrences 
(Figure  3.9). 

There  are,  of  course,  some  restrictions  to  the  use  of 
implementation  9 for  a relation  which  will  be  discussed 
later.  One  important  restriction,  however,  is  that  a 
record  cannot  belong  to  (participate  as  owner  or  member  in) 
more  than  one  occurrence  of  the  same  set  type.  In  other 
words  DBTG  data  base  sets,  by  definition,  can  only  imple- 
ment one-to-many  relationships.  This  is  why  we  could  talk 
about  the  A-value  related  to  a B-value  in  R,  in  the  defi- 
nition of  implementation  9. 

Implementation  10  for  R(A,  B)  is  the  converse  of 
implementation  9,  it  is  called  the  chain  with  next  pointers 
association  of  A under  B and  consists  of  the  two  sets  of 
ordered  pairs: 


OWNR 

- ' {<  b1 , 

mp^  > , 

<b2' 

mp2  > , ... 

' < bp  • ' 

mp  n , > } , 

MMBR 

= ' f<  a1# 

npx>, 

<a2' 

np2>,  ... 

, < a , , 
a’ 

nPf>}- 

i) 

Sj,  , • • 

* ' V 

, and 

b^ , ...  , 

bg,  are 

unique 

and  B-values  respectively. 


ii)  mp^  is  an  address  occurrence  (pointer)  that  con- 


tains : 


r 

np 

np 

60 


iii) 


iv) 


1)  a null  address  value  if  Ik  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  that  contains  the 
first  A-value  related  to  in  R,  otherwise, 

np^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  a^  is  related  to  no 
B-value  in  R, 

2)  the  address  of  the  record  that  contains  the 

B-value  related  to  a^  in  R,  if  a^  is  the 

last  A-value  related  to  b.  in  R 

x 

3)  the  address  of  the  next  A-value  related  to 


b.  in  R,  otherwise 

l 

mp^  and  np^.  are  stored  in  the  same  records  in 
which  a^  and  b_.  are  stored,  respectively. 


i = 1, 


S'  , k = 1, 


This  implementation  uses  the  same  concepts  as  implementa- 
tion 9 except  in  this  case  the  owner  records  of  the  data 
base  sets  that  establish  the  relation  R(A,  B)  in  the  data 
base  contain  unique  B-values  and  the  member  records  contain 
A-values.  We  also  note  that  this  implementation  for  R(A,  B) 
can  only  be  used  if  at  most  one  B-value  is  related  to  every 
A-value  in  R. 


3.1.6  Chain  with  Next  and  Owner  Pointers  Associations 

Implementation  11  for  R(A,  B)  is  called  the  chain 
with  next  and  owner  pointers  association  of  B under  A and 
it  consists  of  the  two  following  sets. 


61 


OWN  R = {<  a ^ , mp1>,<a2,  mp2>,  ...  , < , , mpa,>}, 

MMBR  = {<b1#  nplf  op1>,  <b2,  np2,  op2> , , 

< bp i > nPp i * °Pp  » > } 

where 


i) 


ii) 


iii) 


iv) 


a^,  ...  , a^,  and  b^,  ...  , b^,  are  unique  A- 
and  B-values  respectively, 

mp^  is  an  address  occurrence  (pointer)  that  con- 
tains : 

1)  a null  address  value,  if  a^  is  related  to  no 
B-value  in  R, 

2)  the  address  of  the  record  in  which  the  first 
B-value  related  to  a^  in  R is  stored,  other- 
wise 

np^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  b,  is  related  to  no 

k 

A- value  in  R, 

2)  the  address  of  the  record  in  which  the 

A-value,  a^,  related  to  b^  in  R is  stored, 

if  b,  is  the  last  B-value  related  to  a . in  R, 
k i 

3)  the  address  of  the  record  in  which  the  next 
B-value  related  to  a^  in  R is  stored,  other- 
wise, 

op^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  b^  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  in  which  the 
A-value  related  to  b^  in  R is  stored. 


62 


otherwise , 

v)  mp^  and  <np^,  °Pjc>  are  stored  in  the  same 

records  in  which  a^  and  are  stored,  respec- 
tively . 

i = 1,  ...  ,a'  and  k = 1,  ...  , 8'. 

This  implementation  is  similar  to  implementation  9, 
it  establishes  the  relation  R (A,  B)  in  the  data  base, 
through  data  base  sets,  where  the  owner  record  of  each 
set  contains  a unique  A-value,  a^,  and  (zero  or  more)  mem- 
bers records  of  the  set  contain  unique  B-values  related  to 
a^  in  R.  Data  base  sets  of  this  implementation  are  stored 
as  chains  of  records,  as  discussed  in  the  definition  of 
implementation  9.  However,  in  this  case  the  (member)  record 
in  which  b^  is  stored  k = 1,  ...  , 8'  contains,  in  addition 

to  the  next  pointer  np_.  , another  pointer,  op  , that  holds 

II  K 

a null  address  value  if  b^  is  not  related  to  any  A-value  in 
R,  otherwise  it  holds  the  address  of  the  record  in  which 
the  A-value  is  stored.  Again  this  implies  that  only  one 
A-value  (at  most)  can  be  related  to  each  B-value  and  there- 
fore if  R(A,  B)  is  to  be  implemented  by  implementation  11 
it  must  satisfy  this  requirement  (Figure  3.10). 

Implementation  12  for  R(A,  B)  is  the  converse  of 
implementation  11,  it  is  called  the  chain  with  next  and 
owner  pointers  association  of  A under  B and  it  consists  of 
the  following  two  sets. 

OWNR  = , mpi>  , < bj , mp  2^*  ...  , ^ bp  ( , mp  p ( ^ } , 


63 


MMBR  = {<ax,  npL , op1>,  <a2,  np2,  op2>,  ...  , 

<aa' ' nPa-'  °Pa->} 

where 

i)  a^ , ...  , aa,  and  b^,  ...  , bg,  are  unique  A- 

and  B-values,  respectively, 

ii)  mp^  is  an  address  occurrence  (pointer)  that  con- 
tains : 

1)  a null  address  value  if  b^  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  in  which  the  first 
A-value  related  to  b^  in  R is  stored,  other- 
wise 

iii)  np^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  a^  is  related  to  no 
B-value  in  R, 

2)  tlk.  address  of  the  record  in  which  the 

B-value  (b^)  related  to  a^  in  R is  stored, 

if  a,  is  the  last  A-value  related  to  b.  in  R, 
k i 

3)  the  address  of  the  next  A-value  related  to 
b^  in  R,  otherwise, 

iv'  op,  is  an  address  occurrence  that  contains: 
k 

1)  a null  address  value  if  a^  is  related  to  no 
B-value  in  R, 

2)  the  address  of  the  record  in  which  the 
B-value  related  to  a^  in  R is  stored,  other- 
wise , 

v)  mp^  and  <np^,  °P)t>  arG  store<^  i-n  the  same  records 


64 


in  which  and  are  stored,  respectively, 
i = 1,  ...  , 6'  and  k = 1,  ...  , a'  . 

This  implementation  is  the  converse  of  implementation 
11  in  the  sense  that  in  this  case  unique  B-values  are 
stored  in  the  owner  records  of  the  data  base  sets  of 
implementation  12  and  unique  A-values  are  stored  in  the 
member  records.  We  also  note  that  as  in  the  case  of  imple- 
mentation 10,  implementation  12  for  R(A,  B)  can  only  be 
used  if  at  most  one  B-value  is  related  to  every  A-value  in 
R. 

3.1.7  Chain  with  Next  and  Prior  Pointers  Associations 

Implementation  13  for  R(A,  B)  is  called  the  chain 
W’ith  next  and  prior  pointers  association  of  B under  A and 
it  consists  of  the  two  following  sets  of  3-tuples, 

OWNR  = {<alf  mp^  ^p1>,  <a2,  mp2 ' S'P2>/  *•*  ' 

<aa*'  mpa"  !lpa'>}' 

MMBR  = {<bx,  npj  , pp1>,  < b2 , np2,  PP2>,  ...  , 

<bp,/  nppl,  PPpi>} 

where 

i)  a^  , . . . , a_t,  and  b^,  ...  b^,  are  unique  A-  and 

B-values,  respectively, 

ii)  mp^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  a^  is  related  to  no 
B-value  in  R, 

2)  the  address  of  the  record  in  which  the  first 
B-value  related  to  a.^  in  R is  stored,  other- 
wise , 


65 


iii)  K-p^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  a^  is  related  to  no 
B-value  in  R, 

2)  the  address  of  the  record  in  which  the  last 
B-value  related  to  a^  in  R is  stored,  other- 
wise , 

iv)  np^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  in  which  the 
A-value,  a.,  related  to  b.  in  R is  stored, 

1 K 

if  b,  is  the  last  B-value  related  to  a . in  R, 
k x 

3)  the  address  of  the  record  in  which  the  next 
B-value  related  to  a . in  R is  stored,  other- 

l 

wise, 

v)  pp^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  b^  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  in  which  the 

A-value,  a.,  related  to  b,  in  R is  stored, 
x k 

if  b.  is  the  first  B-value  related  to  a . in 
k x 


3)  the  address  of  the  record  in  which  the  prior 
B-value  related  to  a^  in  R is  stored,  other- 
wise, 

vi)  <mp^,  £p^>  and  <np^,  PP^*  are  stored  in  the  same 

records  in  which  a and  b,  are  stored,  respec- 

x k 


66 


tively , 

i = 1,  ...  , a'  and  k = 1,  ...  , g' 

This  implementation  is  similar  to  implementation  9,  it 
establishes  the  relation  R(A,  B) , in  the  data  base,  through 
data  base  sets  where  the  owner  record  of  each  set  contains 
a unique  A-value,  a^,  and  (zero  or  more)  member  records  of 
the  set  contain  unique  B-values  related  to  a^  in  R.  Data 
base  sets  of  this  implementation  are  stored  as  chains  of 
records,  as  discussed  in  definition  of  implementation  9. 
However,  in  this  case  the  owner  record  in  which  a., 

l 

i = 1,  ...  , a'  is  stored  contains,  in  addition  to  the 
member  pointer  mp^,  another  pointer  £p^  that  holds  a null 
address  value  if  a^  is  not  related  to  any  B-value,  other- 
wise it  holds  the  address  of  the  record  in  which  the  last 
B-value  related  to  a . in  R is  stored.  Also  the  member 

l 

record  in  which  b^,  k = 1,  ...  , 6'  is  stored  contains,  in 
addition  to  the  next  pointer  npk,  another  pointer  ppk  that 
holds  a null  address  value  if  b^  is  not  related  to  any 
A-value  in  R.  If  b.  is  related  to  some  A-value,  a.,  in  R, 

K k 1 

then  pp,  contains  the  address  of  the  record  in  which  a.  is 

stored  in  case  b,  is  the  first  B-value  related  to  a.  in  R 
k l 

otherwise  ppk  contains  the  address  of  the  record  in  which 
the  prior  B-value  related  to  a^  in  R is  stored  (Figure  3.11). 

Again  the  relation  R(A,  B)  should  be  such  that  at  most 
one  A-value  is  related  to  each  B-value  if  implementation  13 
is  to  be  used  for  the  relation. 

Implementation  14  is  the  converse  of  implementation  13, 


67 


mp  ip 


—T 


x. 


Chain  with  Next  and  Prior  Pointers  Association 
of  B under  A 


68 


it  is  called  the  chain  with  next  and  prior  pointers  asso- 
ciation of  A under  B and  it  consists  of  the  following  two 
sets  of  3-tuples. 

OWNR  = {<1*^,  mp^,  ip1>,  <t>2»  n\p2,  g'^2>'  ***  ' 

< ^ p i / ir.p  p , , £p  p i > } , 

MMBR  = {<a1#  nplf  pp1>,  <a2,  np2 , pp2>,  ...  , 

<aa' ' nPa-'  PPa->}' 

where : 


i)  a^ , ...  , a^,  and  b^ , ...  , b^,  are  unique  A- 
and  B-values,  respectively, 

ii)  mp^  in  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  b.  is  not  related 

1 

to  any  A-value  in  R, 

2)  the  address  of  the  record  in  w’hich  the  first 

A-value  related  to  b.  in  R is  stored,  other- 

i 

wise , 

iii)  £p^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  b^  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  in  which  the  last 
A-value  related  to  in  R is  stored,  other- 
wise, 

iv)  np^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  a^  is  related  to  no 
B-value  in  R, 


2)  the  address  of  the  record  in  which  the 


B-value,  b,,  related  to  a. 

i 3 


in  R is  stored. 


69 


v) 


t 


vi) 


if  a,  is  the  last  A-value  related  to  b.  in  R, 
k i 

3)  the  address  of  the  record  in  which  the  next 
B-value  related  to  a^  in  R is  stored,  other- 
wise, 

pp^  is  an  address  occurrence  that  contains: 

1)  a null  address  value  if  a^  is  not  related 
to  any  B-value  in  R, 

2)  the  address  of  the  record  in  which  the 
B-value,  bif  related  to  a^  in  R is  stored, 
if  a,  is  the  first  A-value  related  to  b . in 

k i 

R, 

3)  the  address  of  the  record  in  which  the  prior 
A-value  related  to  b^  in  R is  stored,  other- 
wise , 

<mp.,  P,p.>  and  <np,  , pp,  > are  stored  in  the  same 

records  in  which  b^  and  a^  are  stored,  respec- 
tively. 


i = 1,  ...  , 0'  and  k = 1,  ...  , a'. 

This  implementation  is  the  converse  of  implementation 
13  in  the  sense  that  in  this  case  unique  B-values  are 
stored  in  the  owner  records  of  the  data  base  sets  of  imple- 
mentation 14  and  unique  A-values  are  stored  in  the  member 


70 


r 

i with  next,  prior  and  owner  pointers  association  of  B under 

A and  it  consists  of  the  two  following  sets. 

OWNR  = {< a^f  mp^ , »p1>,  <a2,  mp2,  lp2>,  — , 

<aa«'  mpa"  £pa’>}' 

MMBR  = '{<b1,  np^ , ppL,  op1>,  <b2,  np2,  pp2,  op2>,  ...  , 

<bp.  » np  ,,  PPp.»  °Pp>>}' 

where 

i)  a^,  ...  , aa,  and  b^ , ...  , bg,  are  unique  A- 
and  B-values,  respectively, 

ii)  mp^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  a^  is  not  related 
to  any  B-value  in  R, 

2)  the  address  of  the  record  in  which  the  first 
B-value  related  to  a.  in  R is  stored,  other- 

l 

wise , 

iii)  £p^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  a^  is  related  to  no 
B-value  in  R, 

2)  the  address  of  the  record  in  which  the  last 
B-value  related  to  a^  in  R is  stored,  other- 
wise , 

» 

iv)  np^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  b^  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  in  which  the 
A-value,  a.,  related  to  b.  in  R is  stored, 

if  bk  is  the  last  B-value  related  to  a,^  in  R, 


71 


I. 


3)  the  address  of  the  record  in  which  the  next 
B-value  related  to  a^  in  R is  stored,  other- 
wise, 

v)  pp^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  is  related  to 
no  A-value  in  R, 

2)  the  address  of  the  record  in  which  the 
A-value,  a.,  related  to  b in  R is  stored  if 

X X 

b,  is  the  first  B-value  related  to  a.  in  R, 

3)  the  address  of  the  record  in  which  the  prior 
B-value  related  to  a^  in  R is  stored,  other- 
wise, 

vi)  op^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  b^  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  in  which  the 
A-value,  a.,  related  to  b,  in  R is  stored, 
otherwise , 

vii)  <mp.,  £p.>  and  <np.  , pp,  , op,  > are  stored  in  the 

XX  XXX 

same  records  in  which  a.  and  b,  are  stored, 

i k 

respectively, 

i=l,  2,  ...  , a*  and  k=l,  2,  ...  , 

This  implementation  is,  in  some  sense,  a combination 
of  implementation  11  and  13.  As  in  the  case  of  implementa- 
tion 11  and  13,  implementation  15  for  R(A,  B)  establishes 
the  relation  in  the  data  base  by  data  base  sets.  The  owner 
record  of  each  such  set  contains  a unique  A-value,  a^,  and 


72 


(zero  or  more)  member  record (s)  of  that  set  contain  unique 
B-values  related  to  a^  in  R.  Data  base  sets  of  this  imple- 
mentation are  stored  as  chains  of  records  as  discussed  in 
the  definition  of  implementation  9.  In  data  base  sets  of 
implementation  15,  however,  member  records  contain  both 
owner  and  prior  pointers  in  addition  to  the  next  pointers. 
This  implementation  is  illustrated  in  Figure  3.12,  and  it 
can  be  used  for  relation  R (A,  B)  only  if  at  most  one 
A-value  is  related  to  each  B-value  in  R. 

Implementation  16  is  the  converse  of  implementation  15, 
it  is  called  the  chain  with  next,  prior  and  owner  pointers 
association  of  A under  B and  it  consists  of  the  following 
two  sets. 

OWNR  = V^b^,  mpj , ip1>,  <b9,  mp2,  £p2>,  • ••  » 

<bpl,  mpp,,  £pp,>}, 

MMBR  = {<ax,  npx,  pp1#  opx>,  < a2 , np2,  pp2 , op2>,  ...  , 

<aa.,  npal,  ppa,,  opa,>}, 

where 

i)  a. , ...»  a . and  b. , ...  , bQ.  are  unique  A- 

I Cl  1 p 

and  B-values,  respectively, 
ii)  mp^  is  an  address  occurrence  that  contains: 

1)  a null  address  value  if  b^  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  in  which  the  first 
A-value  related  to  b^  in  R is  stored,  other- 


wise 


73 


/ 


i 


iii)  lp^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  Ik  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  record  in  which  the  last 
A-value  related  to  b^  in  R is  stored,  other- 
wise, 

iv)  np^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  a^  is  related  to  no 
B-value  in  R, 

2)  the  address  of  the  record  in  which  the 

B-value,  b.,  related  to  a.  in  R is  stored, 
x k 

if  a,  is  the  last  A-value  related  tc  b.  in  R, 
k l 

3)  the  address  of  the  record  in  which  the  next 
A-value  related  to  b.  in  R is  stored,  other- 

l 

wise , 

v)  pp^  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  a^  is  related  to  no 
B-value  in  R, 

2)  the  address  of  the  record  in  which  the 

B-value,  b.,  related  to  a.  in  R is  stored  if 
x k 

a,  is  the  first  A-value  related  to  b.  in  R, 
k l 

3)  the  address  of  the  record  in  which  the  prior 
A-value  related  to  b^  in  R is  stored,  other- 
wise , 

vi)  op^  is  an  address  occurrence  that  contains: 

1)  a null  address  value  if  a^  is  related  to  no 


B-value  in  R 


74 


* 


vii) 


2)  the  address  of  the  record  in  which  the 

B-values,  hK  , related  to  a^  in  R is  stored, 
otherwise, 

<mp^,  £p^>  and  <np^,  PP^'  °P)t>  are  stored  in  the 

same  record  in  which  b.  and  a,  are  stored, 

1 k 

respectively , 


i = 1,  2,  ...  , 6*  and  k = 1,  2,  ...  , a’  . 

This  implementation  is  the  converse  of  implementation 
15  in  the  sense  that  in  this  case  unique  B-values  are 
stored  in  the  owner  records  of  the  data  base  sets  of  imple- 
mentation 16  and  unique  A-values  are  stored  in  the  member 
records.  Implementation  16  for  R(A,  B)  can  only  be  used 
if  at  most  one  B-value  is  related  to  each  A-value  in  R. 


3.1.9  Pointer  Array  Associations 

Implementation  17  for  R(A,  B)  is  called  the  pointer 
array  association  of  B under  A and  it  consists  of  the  two 
following  sets. 

OWNR  = {<8^  vplf  pa1>,  <a2,  vp2,  pa2>,  ...  , 

<aa,,  vpa,,  pau,>), 

MMBU  =-  (bL,  b2,  ...  , b ^ , 

» where 


i)  a , ...  , aa,  and  b^ , ...  , bg,  are  unique  A- 
and  B-values,  respectively, 
ii)  pa^  is  a contiguous  sequence  (array)  of  M 

address  occurrences  where  M is  the  maximum  number 
of  B-values  related  to  one  A-value  in  R.  On  the 


I 


75 


I 


average  the  first  r/a*  elements  of  the  array 
contain  addresses  of  records  in  which  B-values 
related  to  ai  in  R are  stored.  The  rest  of  the 
elements  of  the  array  contain  null  address 
values . 

iii)  vp^  is  an  address  occurrence  that  contains  the 

address  of  the  first  element  of  pa.. 

x 

iv)  vp^  is  stored  in  the  same  record  in  which  is 
stored, 

i = 1,  ...  , a' . 

This  implementation  establishes  the  relation  R(A,  B) 
in  the  data  base  through  data  base  sets.  Each  data  base 
set  of  implementation  17  contains  exactly  one  owner  record 
in  which  a unique  A-value,  a^,  is  stored  and  zero  or  more 
member  record (s)  in  which  unique  B-values  related  to 
in  R are  stored.  Data  base  sets,  in  this  implementation 
are  stored  using  the  concept  of  pointer  arrays.  The  data 
base  set  in  whose  owner  record  a^  is  stored  contains  a vec- 
tor pointer  vp^,  that  points  to  an  array  of  pointers  pa^ . 
The  array  pa^  is  a sequence  of  M address  occurrences  stored 

in  contiguous  storage  locations.  If  the  number  of  B-values 

related  to  a . in  R is  t.  where  0 < t.  < M then  the  first  t. 

l x - l - l 

elements  of  the  array  pa^  contain  addresses  of  the  records 
in  which  the  related  B-values  are  stored  and  the  remaining 
M-t^  elements  contain  null  address  values.  We  assume  that 
the  ordering  of  B-values  is  preserved  by  the  ordering  of 
the  pointers  in  the  array.  This  is  to  say  that  for 


76 


k'  =1,  2,  ...  , t^,  the  k'-th  element  of  pa^  contains  the 
address  of  the  record  in  which  the  k'-th  B-value  related 
to  a^  in  R is  stored. 

As  in  the  case  of  implementation  9 the  set  of  all  such 
data  base  sets  constitute  a data  base  set  type  whose  owner 
record  type  is  the  set  of  all  owner  records  of  these  data 
base  sets.  The  set  of  all  member  records  of  these  data 
base  sets  is  a subset  of  the  member  record  type  of  the  set 
type.  Implementation  17  is  illustrated  in  Figure  3.13. 

Implementation  18  for  R(A,  B)  is  called  the  pointer 
array  association  of  A under  B.  This  implementation  is  the 
converse  of  implementation  17  and  consists  of  the  following 
two  sets. 

OWNR  = {<bx,  vpx,  pa1>,  <b2,  vp2 , pa2>,  ...  , 

< bp.  * vPp.  * Pap.  >>  * 

MMBR  = (a^,  a2,  ...  , aa.}» 

where 

i)  a^,  ...  , aa,  and  b^ , ...  , bp,  are  unique  A- 
and  B-values,  respectively, 

ii)  pa^  is  a contiguous  sequence  (array)  of  N 

address  occurrences  where  N is  the  maximum  number 
of  A-values  related  to  one  B-value  in  R.  On  the 
average,  the  first  r/B'  elements  of  the  array 
contain  addresses  of  records  in  which  A-values 
related  to  b^  in  R are  stored.  The  rest  of  the 
elements  of  pa^  contain  null  address  values. 

iii)  vp^  is  an  address  occurrence  that  contains  the 


I 


77 


mp  £p 


Figure  3.12  Chain  with  Next,  Prior  and  Owner  Pointers 
Association  of  B under  A 


78 


I 


address  of  the  first  element  of  pa^. 
iv)  vp.  is  stored  in  the  same  record  in  which  b.  is 

ri  i 

stored, 

i = I#  2,  ...  , 8*. 

This  implementation  is  different  from  implementation 
17  in  that  in  this  case  owner  records  of  the  data  base  sets 
that  establish  the  relation  R(A,  B)  in  the  data  base  con- 
tain unique  B-values  and  the  member  records  contain  unique 
A-values.  The  pointer  array  pa^  has  N elements  where  the 
k'-th  element  contains  the  address  of  the  record  in  which 
the  k'-th  A-value  related  to  b^  in  R is  stored,  for 

k'  = 1,  ...  , tf  where  t?  is  the  number  of  A-values  related 
i l 

to  b.  in  R.  The  remaining  N-t!  elements  of  pa.  contain 
x i x 

null  address  occurrences. 

3.1.10  Pointer  Array  with  Owner  Pointer  Associations 

Implementation  19  for  R(A,  B)  is  called  the  pointer 
array  with  owner  pointers  association  of  B under  A and  it 
consists  of  the  following  sets. 

OWNR  = {<aL,  vpL,  pa1>,  < a2>  vp2 , pa2>,  ...  , 

<aa"  vPa"  Paa'>}' 

MMBR  - (-'bi,  opx>,  <b2,  op2>,  ...  , < bp , , opp,>}, 

where 

i)  a^,  ...  aa,  and  b^ , ...  , b^,  are  unique  A-  and 
B-values,  respectively, 

ii)  pa^  is  a contiguous  sequence  (array)  of  M 

address  occurrences  where  M is  the  maximum  num- 


79 


P 


I 


ber  of  E-values  related  to  one  A-value  in  R. 

On  the  average  the  first  r/a'  elements  of  pa^ 
contain  addresses  of  records  in  which  B-values 
related  to  a^  in  R are  stored.  The  rest  of  the 
elements  of  pa^  contain  null  address  values, 

iii)  vp^  is  an  address  occurrence  that  contains  the 
address  of  the  first  element  of  pa^. 

iv)  op  is  an  address  occurrence  that  contains: 

1)  a null  address  value,  if  is  related  to  no 

A-value  in  R, 

2)  the  address  of  the  record  in  which  the 

A-value,  a.,  related  to  b,  in  R is  stored, 
l k 

otherwise , 

v)  vp^  and  op^  are  stored  in  the  same  records  in 
which  a..  and  are  stored,  respectively, 
i = 1,  2,  ...  , a'  and  k = 1,  2,  ...  , 6 ' - 
This  implementation  is  similar  to  implementation  17, 
it  establishes  the  relation  k(A,  B)  in  the  data  base, 
through  data  base  sets,  where  the  owner  records  of  each 
set  contains  a unique  A-value,  a^,  and  (zero  or  more)  mem- 
ber records  of  the  set  contain  unique  B-values  related  to 
a.  in  R.  Data  base  s^ts  of  implementation  19  are  stored 
using  the  pointer  array  technique  as  discussed  in  the  defi- 
nition of  implementation  17.  However,  in  this  case,  the 
l member)  record  in  which  b^  is  stored,  k = 1,  ...  , 8', 

contains  a pointer,  op.  , that  holds  a null  address  value 

K 

if  )jy  is  not  related  to  any  A-value  in  R,  otherwise  it 


80 


holds  the  address  of  the  record  in  which  the  A-value 
related  to  b^_  in  R is  stored.  This  implementation  can  be 
used  only  for  relations  in  which  at  most  one  A-value  is 
related  to  each  B-value.  Figure  3.14  is  an  illustration 
of  this  implementation. 

Implementation  20  for  R(A,  B)  is  the  converse  of 
implementation  19,  it  is  called  the  pointer  array  with 
owner  pointers  association  of  A under  B and  it  consists 
of  the  following  two  sets. 

OWNR  = {<  b-^ , vplf  pa^> , <b2,  vp2,  pa2>,  ...  , 

<bpi'  vPp.'  Pap.>>' 

MMBR  = {< a1 , op1>,  <a2,  op2>,  ...  , <aa,,  opa,>}, 

where 

i)  a^ , ...  , a^,  and  b^,  ...  , b^,  are  unique  A- 
and  B-values,  respectively, 

ii)  pa^  is  a contiguous  sequence  (array)  of  N 
address  occurrences  where  N is  the  maximum 
number  of  A-values  related  to  one  B-value  in  R. 
On  the  average  the  first  r /£'  elements  of  pa^ 
contain  addresses  of  records  in  which  A-values 
related  to  b^  in  R are  stored.  The  rest  of  the 
elements  of  pa^  contain  null  address  values. 

iii)  vp^  is  an  address  occurrence  that  contains  the 
address  of  the  first  element  of  pa^. 

iv)  op^  is  an  address  occurrence  that  contains: 

1)  a null  address  value  if  a^  is  related  to  no 


B-value  in  R 


MICHIGAN  UNIV  ANN  ARBOR  SYSTEMS  ENtlNEERlNO  LAB  F/B  8/2 

A METHODOLOGY  FOR  DATA  BASE  DESIGN  IN  A PAOINO  ENVIRONMENT, (U) 

SEP  77  C BCRELIAN*  K B IRANI  F30602-76-C-0029 


82 


2)  the  address  of  the  record  in  which  the 

B-value,  b.,  related  to  a.  in  R is  stored, 

1 K 

otherwise, 

v)  vp^  and  op^  are  stored  in  the  same  records  in 
which  and  a^  are  stored,  respectively, 
i - 1,  2,  ...  , 8'  and  k=l,  2,  ...  , a*  . 

This  implementation  is  different  from  implementation 
20  in  that  in  this  case  the  owner  records  of  the  data  base 
sets  that  establish  the  relation  R(A,  B)  in  the  data  base 
contain  unique  B-values  and  the  member  records  contain 
unique  A-values.  The  pointer  array  pa^  has  N elements 
with  the  first  t|  elements  containing  addresses  of  member 
records  and  the  remaining  N-t|  elements  containing  null 
address  values,  where  t|  is  the  number  of  A-values  related 
to  one  B-value  in  R.  For  k'  =1,  2,  ...  , t!  the  k'-th 

l 

element  of  pa^  contains  the  address  of  the  record  in  which 
the  k'-th  A-value  related  to  b^  in  R is  stored.  Implemen- 
tation 20  for  relation  R(A,  B)  can  only  be  used  if  at  most 
one  B-value  is  related  to  each  A-value  in  R. 

Implementation  9 through  20  all  use  the  concept  of 
data  base  sets  to  implement  a relation  such  as  R ( A , Li)  and 
only  one  data  base  set  type  is  necessary  to  establish 
R (A,  B)  in  the  data  base.  What  makes  these  implementations 
different  from  one  another  is  the  set  implementation  tech- 
nique that  is  used  in  each  case  and  whether  data  item  A is 
defined  as  part  of  the  owner  record  type  with  B defined  as 
part  of  the  member  record  type  or  vice  versa.  These 


' "v  »• 


83 


implementations  all  have  the  limitation  that  they  only 
implement  one-to-many  relationships. 

3.1.11  Dummy  Record  Association  of  A and  B 

Implementation  21  for  R(A,  B)  is  called  the  dummy 
record  association  of  A and  B and  it  consists  of  the  fol- 
lowing three  sets. 

OWNR  = {< ax , mp1>,  <a2,  mp2>,  ...  , <aa,,  mpa,>}, 

OWNR'  = {cb.^  mp^>,  <b2,  mp2> , ...  , <bpt,  mp^,>}, 

LINK  ={<op^,  oplf  npj,  np1>,  < op£ , op2 , np^ , np2>,  •••# 
< op^-»  opr,  np^.,  npr>}, 

where 

i)  a^,  ...  a^,  and  b^,  ...  , bg,  are  unique  A-  and 
B-values,  respectively, 

ii)  <op£,  °Pjc'  npk'  npk>'  ^ = 1»  •••  * r is  a contig- 
uous sequence  of  4 address  occurrences  where  r is 
the  cardinality  of  R(A,  B) , 

iii)  for  each  pair  <a,  b>  e R there  exists  exactly 
one  4 -tuple  <op^,  op^,  nP£»  n^>k>  e LINK  (an<^ 
vice  versa)  in  which: 

1)  op^  contains  the  address  of  the  record  in 
which  a is  stored; 

2)  op^  contains  the  address  of  the  record  in 
which  b is  stored; 

3)  np^  contains  the  address  of  the  first  element 
of  the  4-tuple  in  LINK  associated  with  the 
pair  <a,  b’>  e R such  that  b*  is  the  next 


,1 

I 


84 


(with  respect  to  b)  B-value  related  to  a in 
R.  If  b is  the  last  B-value  related  to  a in 
R then  np^  contains  the  address  of  the  record 
in  which  a is  stored, 

4)  np^  contains  the  address  of  the  first  ele- 
ment of  the  4-tuple  in  LINK  associated  with 
the  pair  <a',  b>  e R such  that  a'  is  the 
next  (with  respect  to  a)  A-value  related  to 
b in  R.  If  a is  the  last  a-value  related  to 
be  in  R then  np^  contains  the  address  of  the 
record  in  which  b is  stored, 

iv)  mp^  is  an  address  occurrence  that  contains: 

1)  a null  address  value  if  a^  is  related  to  no 
B-value  in  R, 

2)  the  address  of  the  first  element  of  the 
4-tuple  in  LINK  associated  with  <a^,  b>  in 
R such  that  b is  the  first  B-value  related 
to  a . in  R, 

l 

v)  mpj  is  an  address  occurrence  that  contains: 

1)  a null  address  value  if  b^  is  related  to  no 
A-value  in  R, 

2)  the  address  of  the  first  element  of  the 
4-tuple  in  LINK  associated  with  <a,  b^>  in  R 
such  that  a is  the  first  A-value  related  to 


b^  in  R, 

vi)  mp^  and  mp^  are  stoed  in  the  same  records  in 
which  a.  and  b.  are  stored,  respectively, 

X 


85 


vii)  4-tuples  <QPj\  opk,  np£,  npk> f k = 1,  ...  , r 
are  called  dummy  records. 

Implementation  21  for  R(A,  B)  uses  the  concept  of  data  base 
sets  to  establish  the  relation  R (A,  B)  in  the  data  base. 
However,  in  contrast  to  other  associations,  implementation 
21  uses  two  data  base  set  types.  Each  data  base  set  of  the 
first  type  consists  of  exactly  one  owner  record  in  which  a 
unique  A-value,  a^,  is  stored  and  zero  or  more  member 
records  that  are  dummy  records  associated  with  pairs 
<a^,  bJi>  for  all  2.  such  that  <a^,  b^>  e R.  Similarly,  each 
data  base  set  of  the  second  type  consists  of  exactly  one 
owner  record  in  which  a unique  B-value,  b^,  is  stored  and 
zero  one  more  member  records  that  are  dummy  records  asso- 
ciated with  pairs  <a„,  b.>  for  all  2 such  that  <a„,  b.>  e R. 

1 2 x 2 l 

The  set  of  all  owner  records  of  data  base  sets  of  the  first 
(second)  type  constitute  the  owner  record  type  of  the  first 
(second)  data  base  set  type.  The  set  of  all  member  records 
(dummy  records  in  this  case)  constitute  the  member  record 
type  of  both  set  types. 

Contrary  to  all  other  association  implementations, 
implementation  21  may  bo  used  to  implement  many-to-many 
relationships . 

The  reason  is  that  dummy  records  that  constitute  mem- 
ber records  of  this  implementation,  although  of  the  same 
record  type,  belong  to  data  base  sets  of  different  set 
types.  Each  data  base  set  of  implementation  21,  of  either 
type,  is  stored  with  a set  implementation  technique  iden- 


86 


tical  to  that  of  chain  with  next  and  owner  pointer  asso- 
ciations discussed  in  3.1.6.  Other  techniques  (or  combi- 
nations of  other  techniques)  may  be  used  for  storing  data 
base  sets  of  this  implementation  thereby  producing  many 
more  new  relation  implementation  alternatives.  We  are  not 
including  those  other  alternatives  because  they  do  not  bear 
any  new  ideas  and  they  can  easily  be  added  to  the  model  if 
desired.  The  technique  adopted  above  has  a reasonable 
storage/time  tradeoff.  It  was  superior  to  some  other 
possible  techniques  which  we  tried. 

Implementation  21  is  illustrated  in  Figure  3.15. 

3.1.12  Linkages 

Implementation  22  for  R (A , B)  is  called  the  single 
linkage  of  B under  A and  it  consists  of  the  following  two 
sets . 

{<aL,  kpL>,  <a2,  kp2>,  ...  , <aa,,  kpa,>}, 

(b^ / b2 , ...  , bg | 1 / 

where 

i)  a1#  ...  , aQ,  and  b^,  ...  , bg,  are  unique  A- 
and  B-values,  respectively, 

.ii)  kp^  is  a vector  of  M address  occurrences  where 
M is  the  maximum  number  of  B-values  related  to 
an  A-value  in  R, 

iii)  If  t^  (0  < t^  <_  M)  is  the  number  of  B-values 

related  to  a^^  in  R,  then  for  k = 1,  ...  , t^ , 
the  k-th  element  of  kp^^  contains  the  address  of 


88 


the  record  in  which  the  k-th  B-value  related  to 
a^  in  R is  stored.  The  remaining  M-t^  elements 
of  kp^  contain  null  address  values.  Note  that 
if  a^  is  related  to  no  B-value  in  R (t^  = 0) 
then  all  M elements  of  kp^  contain  null  address 
values , 

iv)  kp.  is  stored  in  the  same  record  in  which  a.  is 
1 i 

stored, 

i = 1 , ...  , a ' . 

Implementation  22  establishes  the  relation  R(A,  B)  in 
the  data  base  by  storing  vectors  of  pointer  (data  base  key) 
data  items , in  records.  The  record  in  which  a. 

i 

(i  = 1,  ...  , a')  is  stored  contains  a fixed-length  vector 
kp;  of  pointers  (data  base  keys).  On  the  average  r/a' 
elements  of  kp^  contain  addresses  of  records  in  which 
B-values  related  to  a^  in  R are  stored.  The  rest  of  the 
elements  of  kp^  contain  null  address  values. 

Contrary  to  association  implementations,  implementation 
22  does  not  use  data  base  sets  although  relationships  are 
real ized  through  pointers.  Pointers  used  in  implementation 
22  may  be  explicitly  retrieved  and/or  updated,  like  any 
other  data  item,  by  programs  accessing  the  data  base, 
whereas  pointers  used  to  store  data  base  sets  are  created 
and  maintained  by  the  data  base  management  system. 

The  fact  that  no  data  base  sets  are  used  in  implemen- 
tation 22  makes  it  possible  to  implement  a relation  R(A,  B) 
even  if  it  is  not  a one-to-many  relation.  Implementation  22 


89 


is  illustrated  in  Figure  3.16. 

Implementation  23  for  R(A,  B)  is  called  the  single 
linkage  of  A under  B and  it  consists  of  the  following  two 
sets . 

{<blf  kp1>,  <b2,  kp2>,  ...  , < bg  , , kpg,  > } , 

...  , a^,  } , 

where 

i)  a^,  ...  , aa,  and  b^,  ...  , b^f  are  unique  A- 
and  B-values,  respectively, 

ii)  kp^  is  a vector  of  N address  occurrences  where 
N is  the  maximum  number  of  A-values  related  to 
a B-value  in  R, 

iii)  If  b^  is  related  to  no  A-value  in  R,  then  all  N 
elements  of  kp^  contain  null  address  values, 
otherwise  for  k = 1,  ...  , t^,  (where  t|  is  the 
number  of  A-values  related  to  b^  in  R)  the  k-th 
element  of  kp.  contains  the  address  of  the 

l 

record  in  which  the  k-th  A-value  related  to  b^ 
in  R is  stored, 

iv)  kp  is  stored  in  the  same  record  in  which  b.  is 
1 l i 

stored, 

i — 1 , ...  , 8 . 

Implementation  23  for  R(A,  B)  is  the  converse  of 
implementation  22  in  the  sense  that  in  this  case  pointer 
vectors  kp^  are  stored  in  records  that  contain  B-values, 
and  they  point  to  records  in  which  related  A-values  are 


stored. 


90 


Implementations  22  and  23  are  very  similar  in  the  way 
they  actually  establish  a data  base  relation  to  implemen- 
tations 17  and  19,  respectively,  defined  in  3.1.9.  The 
important  difference  is  in  the  fact  that  in  pointer  array 
associations  (implementations  17  and  19)  data  base  set  types 
will  be  defined  which  will  establish  the  relation.  This 
has  the  implication  that  pointers  used  in  the  implementation 
will  be  maintained  by  the  data  base  management  system  in  the 
case  of  associations,  rather  than  the  user  in  the  case  of 
linkages.  Clearly  there  is  a difference  between  the  two 
cases  as  far  as  user  convenience  is  concerned  which  we  will 
not  be  able  to  capture  in  our  evaluations  of  implementation 
alternatives.  However  a simple  modification  to  our  imple- 
mentation of  the  design  methodology  may  be  made  to  allow 
the  designer  to  prevent  the  selection  of  certain  implemen- 
tations for  certain  relations. 

The  last  implementation  alternative  that  we  include 
in  our  model  is  implementation  24  called  the  double  linkage 
of  A and  B and  it  consists  of  the  two  following  sets  of 
ordered  pairs. 


(<alf 

kpx>,  <a2. 

kp2>,  . . . 

, <V.  kpa.». 

{<  , 

kp|>,  <b2. 

kp^>,  . . . 

, < bg,  , kpg , > } , 

i)  a^,  ...  , aQl  and  , ...  , bg,  are  unique  A- 
and  B-values,  respectively, 
ii)  kp^  is  a vector  of  M address  occurrences  where 


where 


91 


M is  the  maximum  number  of  B-values  related  to 
an  A-value  in  R, 

iii)  if  a^  is  related  to  no  B-value  in  R,  then  all  M 
elements  of  kp^  contain  null  address  values, 
otherwise  for  k'  =1,  ...  , t^  (where  t^  is  the 
number  of  B-values  related  to  a^  in  R)  the  k'-th 
element  of  kp.  contains  the  address  of  the 

i 

record  in  which  the  k'-th  B-value  related  to  a. 

r 

in  R is  stored, 

iv)  kp'  is  a vector  of  N address  occurrences  where  N 

K. 

is  the  maximum  number  of  A-values  related  to  a 
B-value  in  R, 

v)  if  b is  related  to  no  A-value  in  R,  then  all  N 

Tv 

elements  of  kp(  contain  null  address  values, 

otherwise  for  k'  =1,  ...  , t£  (where  t^  is  the 

number  of  A-values  related  to  b^  in  R)  the  k'-th 

element  of  kp^  contains  the  address  of  the  record 

in  which  the  k'-th  A-value  related  to  b.  in  R 

k 

is  stored, 

vi)  kp.  and  kp,'  are  stored  in  the  same  records  in 
i k 

which  a^  and  b^  are  stored,  respectively, 
i = 1 , ...  , a ' and  k = 1 , ...  , B ' . 
Implementation  24  is  a combination  of  implementations 
22  and  23.  It  establishes  the  relation  R(A,  B)  in  the 
data  base  by  storing  vectors  of  pointers  (data  base  keys) 
in  records.  The  record  in  which  a^  (i  = 1,  ...  , a')  is 
stored  contains  a fixed-length  vector  kp^  of  M pointers. 

On  the  average  r/a’  elements  of  kp^  contain  addresses  of 


92 

records  in  which  B-values  related  to  a^  in  R are  stored. 

The  rest  of  the  elements  of  kp^  contain  null  address  values. 
Also  the  record  in  which  (k  = 1,  ...,  6')  is  stored 
contains  a fixed-length  vector  kp^  of  N pointers.  On  the 
average  r/B'  elements  of  kp^  contain  adaresses  of  records 
in  which  A-values  related  to  in  R are  stored  and  the 
rest  of  the  elements  contain  null  address  values  (Figure 
3.17)  . 

This  concludes  the  definitions  of  the  set  of  24  imple- 
mentation alternatives  in  our  model.  In  Section  3.2  we 
present  a set  of  constraint  conditions  for  each  implemen- 
tation and  then  in  Section  3.3  and  3.4  we  present  storage 
and  time  costs  for  each  implementation. 

It  is  the  combination  of  exactly  one  implementation 
for  every  relation  that  determines  the  structure  of  the 
data  base.  The  constraint  conditions  narrow  down  the 
number  of  alternatives  for  implementing  a relation  to  those 
that  produce  legal  DBTG  structures  and  then  the  optimiza- 
tion algorithm  selects  (a  reasonably)  optimal  solution 
namely  one  with  minimum  time  cost:  subject  to  a storage 
space  limit. 

3.2  Constraint  Conditions 

In  this  section  we  develop  a set  of  conditions  for 
every  implementation  alternative.  These  conditions  must  be 
satisfied  by  the  variables  that  define  a data  base  relation 
in  order  that  it  can  be  implemented  by  the  implementation 


94 


to  which  the  conditions  pertain.  The  purpose  of  imposing 
these  constraint  conditions  on  relation  implementations  is 
to  guarantee  that  the  resulting  data  structure  definitions 
stay  within  the  limitations  of  DBTG  specifications.  In 
addition  to  that,  certain  security  requirements  of  run  units 
affect  the  choice  of  an  implementation  for  a relation  and 
this  will  also  be  reflected  in  the  constraint  conditions. 

For  example,  the  non-redundancy  convention  of  DBTG  proposal 
requires  that  a data  relationship  be  stored  only  once  in  the 
data  base.  This  implies  that  exactly  one  implementation 
must  be  selected  for  each  relation,  and  in  a relation 
implementation  each  element  (ordered  pair)  of  the  relation 
be  represented  (stored)  only  once.  Another  DBTG  convention 
requires  that  variable  dimensioned  vectors  or  repeating 
groups  be  allowed  only  at  level  1 of  a record  definition. 

In  order  to  increase  security  we  will  add 
constraint  conditions  that  will  prevent  the  implementation 
of  a relation  by  duplication  or  aggregation  when  there  is 
a run  unit  which  must  be  denied  access  to  the  relation  while 
it  should  be  able  to  access  individual  data  item  types  of 
the  relation. 

The  constraint  conditions  will  also  reduce  the  size  of 
solution  space  by  rejecting  clearly  costly  and/or  incon- 
sistent choices  of  implementations  for  a given  relation. 

We  now  proceed  with  presenting,  for  each  implementation 
number  MPL,  the  constraint  conditions  CC(MPL,  c)  that  a 
relation  must  satisfy  so  that  it  can  be  implemented  by 


95 


implementation  number  MPL.  (c  is  an  integer  representing 
the  condition  number.) 

We  first  define  a set  of  variables  AC (MPL,  j),  one  for 
each  < implementation , relation>  pair*  The  value  of  this 
variable  is  1 if  implementation  MPL  is  acceptable  for 
relation  R ^ and  zero  otherwise.  These  variables  will  be 
expressed  in  terms  of  logical  expressions  involving  con- 
straint conditions  such  as  CC(MPL,  c) . 


3.2.1  Implementations  1 and  2 

Implementation  1 for  (A,  B)  which  is  the  fixed 
duplication  of  B under  A is  always  acceptable  except  if 
there  is  a run  unit  RU,  such  that  ORG  (R . ) e PSI,  and 

K J A 

DST(R^)  e FSI^  but  Rj  t PSR^.  (PSI^  and  PSR^  were  defined 
in  Chapter  II . ) 

CC(1,  1)  : V k e (1,  2,  , NU>  (ORG(R_.)  e PSIk  & 

DST  (R  j ) e PSIk  =»  R . £ PSf^) 

AC ( 1 , j)  = 1 if  CC(1,  1)  is  true 
= 0 otherwise 


Similarly  implementation  2 is  acceptable  for  R ^ if  the  above 
condition  holds  so  AC (2,  j)  = AC(1,  j). 


3.2.2  Implementations  3 and  4 

Implementations  3 and  4 for  R ^ are  variable  dupli- 
cations and  hence  they  will  be  represented  by  variably 
dimensioned  vectors  or  repeating  groups.  The  DBTG  limita- 
tion that  permits  variably  dimensioned  vectors  or  repeating 
groups  only  in  level  1 requires  the  imposition  of  the 


96 


following  constraint  conditions  in  this  case. 

CC ( 3 , 2)  : AC ( 5 , i)  = AC(7,  i)  = 0 for  all  i 

such  that  DST(Ri)  = ORG ( R . ) 

CC  ( 3,  3)  : AC  ( 6 , i)  = AC(8,  i)  = 0 for  all  i ji  j 

such  that  ORG(Ri)  = ORG(R^) 

These  two  conditions  guarantee  that  ORG(R^)  will  not  be 
aggregated  under  any  other  data  item  type. 

Implementations  3 and  4 are  among  the  implementation 
types  which  require  that  the  two  data  item  types  of  the 
relation  be  defined  in  one  record  type.  Therefore,  as  in 
the  previous  case,  if  there  is  a run  unit  which  should  be 
given  access  to  ORG(Rj)  and  DST(RJ  and  at  the  same  time 
be  denied  access  to  R ^ then  these  two  implementations 
(3  & 4)  must  be  made  unacceptable  for  R ^ . In  other  words, 
the  following  condition  must  be  true  for  R ^ if  it  is  to  be 
implemented  by  implementation  3. 

CC ( 3 , 1)  : V k e {1,  2, 

DST  (R  . ) e PSI. 


, NU}  (ORG  (R . ) e PSI.  S- 
3 K 

=»  R.  e PSR.  ) 

3 k 


3 k 

In  summary,  implementations  3 and  4 are  acceptable  for  R^ 
if  AC (3,  j)  = 1 and  AC (4,  j)  = 1,  respectively,  where 
AC  (3,  j)  and  AC(4,  j)  are  defined  as  follows. 

CC ( 3 , 1)  = CC(1,  1)  : V k e (1,  2,  ...  , NU} 

(ORG  (R  . ) , DST  (R  . ) e PSI. 


Rj  e PSR^ 


CC (3 , 2)  : AC ( 5 , i)  = AC(7,  i) 


for  all  i such 


that  DST  (R^)  = ORG(RJ 


97 


CC  (3, 

3) 

5 

AC (6 , i)  = AC  (8 , 

i)  =0  for 

all  i ft  j 

such 

that  ORG (R^ ) 

= ORG (R . ) 
3 

AC  (3, 

j) 

= 

1 if  CC(3,  c)  is 

true  for  c = 

1,  2,  3 

= 

0 otherwise 

Similarly 

CC  (4 , 

1) 

= 

CC(1,  1) 

CC  (4  , 

2) 

: 

AC  (4,  i)  = AC (7, 

i)  = 0 for 

all  i ? j 

such 

that  DST (R . ) 

i 

= DST (Rj ) 

CC  ( 4 , 

3) 

AC (6,  i)  = AC  (8, 

i ) — 0 for 

all  i such 

that 

ORG(Ri)  = DST (Rj ) 

AC  (4, 

j) 

= 

1 if  CC(4,  c)  is 

true  for  c = 

1,  2,  3 

0 otherwise 

3.2.3  Constraint  Conditions  for  Implementations  5 and  6 

These  two  implementations  are  the  fixed  aggregations 
of  B under  A and  A under  B,  respectively,  for  R^ (A,  B)  and 
they  were  defined  in  3.1.3. 

Again  since  the  two  data  item  types  of  the  relation, 
ORG(Rj)  and  DST(R_.),  will  be  defined  in  one  record  type, 
we  require  that  the  following  condition  hold  if  R^  is  to 
be  implemented  by,  say,  implementation  5. 

CC ( 5 , 1)  : V k e {1,  2,  ...  , NU>  (ORG(Rj), 

DST (R . ) e PSI  =»  R.  e PSR  ) 

3 <3  K 

In  implementation  5 for  R^ (A,  B) , B-values  related  to  a 
single  A-value  in  R.  are  stored  in  the  record  that  the 
A-value  is  stored  and  these  same  stored  occurrences  of  a 
and  B values  may  be  used  to  implement  other  relations  on  A 


98 


or  B . 

The  use  of  same  stored  values  of  one  data  item  type  to 
implement  multiple  relations  may  cause  a number  of  problems 
which  we  are  trying  to  avoid  by  the  imposition  of  constraint 
conditions.  These  problems  and  constraint  conditions  will 
be  discussed  below.  But  first  we  note  that  in  the  special 
case  where  R^  e SINR,  R^  is  the  only  data  base  relation  in 
which  DST(Rj)  participates  (either  as  origin  or  as  desti- 
nation data  item  type)  and  therefore  no  problems  can  arise. 
Implementation  5 for  R^  is,  then,  acceptable  if  CC(5,  1) 
and  CC(5,  2)  both  hold  for  R ^ . CC(5,  1)  was  given  before 

and  CC(5,  2)  is  as  follows. 


CC  ( 5 ..  2) 


R.  e SINR 
3 


When  CC(5,  2)  is  not  true  for  R ^ , i.e.,  when  DST(R^)  parti- 
cipates in  other  data  base  relations  we  require  R^  to  be  of 
type  1,  for  the  following  reasons. 

We  recall  that  a fixed  aggregation  of  B under  A 
(implementation  5)  uses  fixed-sized  vectors  to  store 
B-values  under  the  related  A-valuc.  These  vectors  are  of 
size  Mj  and  if  for  some  element  of  A the  number  of 
related  B-values  (in  R ^ ) is  less  than  , say  t,  then 
Mj-t  null  B-values  must  be  stored  in  the  vector.  The  prob- 
lem arises  from  the  fact  that  other  implementations  use  the 
same  B-values  in  the  sense  that  other  data  items  may  be 
either  aggregated  or  duplicated  under  these  values  which 
are  in  some  instances  only  null  values.  This  has  the 
undesirable  effect  of  producing  large  amounts  of  wasted 


99 


storage  because  all  vectors  repeated  under  null  B-values 
contain  (again)  null  values.  In  order  to  avoid  this 
situation  we  require  that  m,.  be  equal  to  M ^ . For  the 
same  reason  every  A-value  must  be  related  to  at  least  one 
B-value  that  is  should  be  different  from  zero,  which 
together  with  m^  = M ^ implies: 

M . = m . = m . . 

DID 

On  the  other  hand,  every  B-value  must  be  related  to  at 
most  one  A-value.  Again,  since  the  same  occurrences  of 
B-values  may  be  used  to  implement  other  relations,  if  the 
same  B-values  are  repeated  under  more  than  one  A-value  then 
some  pairs  of  those  relations  will  be  stored  more  than  once 
in  the  data  base  in  violation  of  the  non-redundancy  require- 
ment. However,  each  B-value  must  be  related  to  at  least 
one  A-value  because  otherwise  it  would  be  impossible  to 
use  the  non-related  B-values  to  implement  other  relations. 
Therefore  the  following  must  be  true: 

n . = n . = N . = 1 . 

3 3 3 

This  requirement  together  with  mj  = mj  ~ characterizes  a 
type  1 relation  and  so  we  have  our  next  constraint  condi- 
tion regarding  implementation  5 for  R^  . 

CC(5,  3)  : R j is  a type  1 relation. 

The  next  and  final  constraint  condition  for  using 
implementation  5 to  implement  R^  is  intended  to  prevent 
aggregation  of  one  data  item  type  under  more  than  one  data 


item  type.  Implementation  5 for  R^ (A,  B)  aggregates  B 
under  A and  it  must  be  made  acceptable  only  if  A is  the 


100 


» 


only  data  item  type  under  which  B will  be  aggregated. 

This  constraint  is  necessary  because,  by  definition,  the 
occurrences  of  B-values  in  implementation  5 are  the  only 
non-duplicate  copies  and  any  other  occurrences  of  B-values 
in  other  parts  of  the  data  base  are  duplicate  copies 
(results  of  implementations  1-4).  If  data  item  type  B is 
the  destination  data  item  type  of  more  than  one  relation, 
all  of  which  are  of  type  1,  then  it  is  potentially  possible 
to  aggregate  B under  more  than  one  data  item  type.  In  this 
case,  we  have  to  make  a decision  as  to  which  one  of  the 
candidate  relations  must  be  allowed  to  be  implemented 


using  implementation  5.  A reasonable  decision  is  to 
select  the  relation  that  produces  the  smallest  number  of 
B-values  per  record.  This  results  in  records  of  shortest 
length  over  all  other  decisions. 

The  final  constraint  condition  for  implementation  5 is, 
then,  given  by  the  following. 

CC ( 5 , 4)  : (Mj  2 NT  for  all  i ^ j such  that 

DST (R . ) = DST ( R . ) and  R.  is  of 
l 2 l 

type  1) 

and  (Mj  2 for  all  i / j such  that 

ORG (R . ) = DST (R . ) and  R.  is  of 

l j i 

type  2) 


It  is  evident  that  this  condition,  while  sufficient  to 
prevent  the  aggregation  of  DST(R^)  under  more  than  one  data 
item  type,  is  stronger  than  necessary.  This  is  because 
CC(5,  4)  rejects  implementation  5 or  6 for  some  relations 


101 


R^  in  favor  of  implementation  5 for  R ^ , while  it  is  not 
known  whether  implementation  5 will  actually  be  selected 
for  Rj  by  the  optimization  algorithm.  Although  the  fact 
that  this  condition  is  stronger  than  necessary  has  the 
side  effect  of  excluding  some  potentially  acceptable  data 
base  designs  from  the  solution  space,  we  need  to  adopt 
the  condition  as  stated  above  because  it  allows  the  deci- 
sion stages  of  the  optimization  algorithm  to  be  independent 
of  one  another.  The  decision  stages  of  the  optimization 
algorithm  must  be  independent  in  the  sense  that  the  choice 
of  one  implementation  for  a relation  (a  decision)  must 
only  affect  other  decisions  (and  consequently  the  solution) 
by  the  amounts  of  the  storage  and  time  costs  of  the  rela- 
tion implementation.  It  should  not  restrict  (or  be 
restricted  by)  other  choices  for  other  relations. 

The  complete  set  of  constraint  conditions  for  imple- 
mentations 5 and  6 is,  then,  given  as  follows. 

CC ( 5 , 1)  = CC(1,  1) 

CC ( 5 , 2)  : R.  e SINR 

3 

CC ( 5 , 3)  : Rj  e {R^  f SR  | R.  is  a type  1 relation} 

CC(5,  4)  : (Mj  £ NT  for  all  i ? j such  that 

DST  (Ri ) = DST  ( Rj  ) and  R..  is  of 
type  1) 

and  (M.  < N.  for  all  i ^ j such  that 
3 3 

ORG(Ri)  = DST (R j ) and  Ri  is  of 
type  2) 

Implementation  5 for  R^  (A,  B)  is  acceptable  if  AC(5,  j)  = 1, 


102 


where 

AC  (5 , j)  = 1 if  CC  ( 5 , 1)  A (CC  (5 , 2)  v 'CC(5,  3)  a 

a CC ( 5 , 4))) 

= 0 otherwise. 

Similarly,  constraint  conditions  for  implementation  6 are: 


CC  (6 , 

1)  = 

CC(1,  1) 

CC  (6  , 

2)  : 

R.  e SONR 
3 

CC  (6 , 

3)  : 

Rj  e {R^  e SR  | R.  is  a type  2 relation} 

CC  (6 , 

4)  : 

(Nj  < for  all  i ^ j such  that 

ORG(Ri)  = ORG(Ri)  and  Ri  is  of 
type  2) 

and 

(N..  < Mi  for  all  i ? j such  that 

DST (R . ) = ORG (R . ) and  R.  is  of 
* 3 3 1 

type  1)  . 

Implementa 
where , 

tion  6 

for  Rj (A,  B)  is  acceptable  if  AC(6,  j)  = 

AC  (6, 

j)  = 

1 if  CC  (6 , 1)  a (CC  (6 , 2)  v (CC  ( 5 , 3)  A 
A CC  ( 5 , 4)  )) 

= 0 otherwise. 

3.2.4  Constraint  Conditions  for  Implementations  7 and  8 
Implementations  7 and  8 are  similar  to  5 and  6, 
respectively,  with  the  difference  that  in  7 and  8 wc.  are 
dealing  with  variably  dimentioned  vectors  or  repeating 
groups.  For  example,  implementation  7 for  R^  (A,  B)  stores 
vectors  of  variable  lengths  (average  iik)  of  B-values  under 
the  related  A-values  in  the  record  in  which  the  later  is 
stored.  In  that  respect  constraint  conditions  for  7 and  8 


103 


are  rather  similar  to  those  for  5 and  6 respectively, 
except  that  we  have  to  make  sure  that  the  variably  dimen- 
sioned vectors  of  the  implementation  do  not  occur  at  levels 
higher  than  1 in  a record  definition.  In  other  words,  we 
have  to  make  implementation  7 unacceptable  for  R ^ if 
ORG(Rj)  could  be  aggregated  (fixed  or  variable)  under  any 
other  data  item  type  to  implement  another  relation.  We  now 
proceed  with  the  definition  of  the  constraint  conditions 
regarding  implementation  7 for  R^ (A,  B) . Similar  condi- 
tions will  be  presented  for  implementation  8 but  not 
discussed  in  detail. 

Like  all  other  duplications  and  aggregations,  the 
first  constraint  condition  deals  with  the  security  require- 
ments. This  constraint  condition  prevents  the  use  of  imple- 
mentation 7 for  Rj (A,  B)  if  there  exists  a run  unit  which 

must  be  denied  access  to  R.  while  it  should  be  able  to 

3 

access  A and  B (through  other  relations) . The  first  con- 
straint condition,  then,  for  implementation  7 is  the 
following. 

CC  (7,  1)  : V k e (1,  2,  ...  , NU>  (ORGfR..), 

DST  (R . ) e PSI,  R.  e PSR.) 

3 k 3 k 

The  next  two  conditions  are  intended  to  guarantee  that 
the  variably  dimensioned  vectors  of  this  implementation 
occur  only  at  level  1 of  a record  type.  There  are  two  ways 
that  the  variably  dimensioned  vector  of  A-values 
(A  = ORG (R j ) ) can  end  up  at  levels  higher  than  one,  1)  if 
A is  also  the  destination  data  item  type  of  another  relation 


104 


Rj  e SR,  for  which  implementation  5 or  7 are  acceptable 
and  2)  if  A is  also  the  origin  data  item  type  of  another 
relation  R ^ e SR,  for  which  implementation  6 or  7 are 
acceptable.  The  first  situation  is  avoided  by  requiring 
the  following  condition. 


CC(7,  2)  : R.  is  of  type  1 or  3 for  no  i (i  ^ j) 

such  that  DST (R . ) = ORG ( R . ) 

1 3 

As  will  be  seen  later  in  this  section  one  of  the 


constraint  conditions  for  implementations  5 and  7 for  a 


relation  is  that  the  relation  must  be  of  types  1 and  3 
respectively.  The  requirement  that  A should  not  partici- 
pate as  destination  in  any  relation  of  SR  which  is  of  type 
1 excludes  the  possibility  of  aggregation  of  A under  any 
other  data  item  type.  This  becomes  clear  by  looking  at 


CC(5,  3)  in  the  last  section.  Note,  however,  that 
CC(5,  3)  = FALSE  is  sufficient  to  make  AC(5,  j)  = 0 
because  CC(5,  2)  is  true  only  if  A e SINR  which  is  impos- 
sible since  A = ORG(R^).  Similarly  if  R^  is  of  type  3 for 
any  relation  R^  such  that  DST(R^)  = ORG(Rj)  then  implemen- 
tation 7 is  unacceptable  for  all  relations  in  which  A 
participates  as  destination. 

Similarly  we  can  show  that  CC(7,  3)  defined  below  is 
sufficient  to  insure  that  implementations  6 and  8 are 
unacceptable  for  all  relations  in  which  A participates  as 
origin.  The  two  conditions  CC(7,  2)  and  CC(7,  3)  will  then 
guarantee  that  A = ORG(R^)  will  not  be  aggregated  under  any 
other  data  item  type. 


105 


CC{7,  3)  : R^  is  of  type  2 or  4 for  no  i ^ j such 

that  0RG(Ri)  = ORG (R ^ ) . 

Again,  these  two  conditions,  CC(7,  2)  and  CC  ( 7,  3)  are 
only  sufficient  conditions.  They  maxe  implementation  7 
unacceptable  for  a relation  R..  if  ORG(R^)  could  possibly  be 
aggregated  under  any  data  item  type  irrespective  of  whether 
it  actually  will  be.  We  have  to  adopt  these  conditions, 
however,  for  the  same  reason  as  discussed  in  Section  3.2.3 
in  the  case  of  constraint  condition  CC(5,  4). 

The  next  constraint  condition  for  implementation  7 
is  similar  to  CC(5,  2)  discussed  in  4.3.  Again  if  no  other 
data  base  relation  is  defined  over  B = DST(RJ,  i.e.,  if 
Rj  e SINR,  then  no  difficulty  can  arise  by  aggregating  B 
under  A for  R ^ and  hence  if  the  following  is  true  we  make 
implementation  7 acceptable  for  R^  regardless  of  the  type 
of  the  relation  (of  course  CC(7,  3)  and  CC(7,  2)  must  be 
true) . 

CC ( 7 , 4 ) : R.  E SINR 

However,  if  the  above  constraint  condition  is  not  satisfied 
the  following  conditions  should  hold  so  that  no  inconsis- 
tancies  would  result  by  aggregating  B under  A.  For  the 
same  arguments  as  described  in  3.2.3  which  led  to  CC(5,  3) 
we  can  show  that  the  following  should  hold  for  R^ . 

N , = n . = n . = 1 
3 3 3 

However,  since  implementation  7 is  a variable  aggregation 
(of  B under  A)  and  it  therefore  does  not  have  data  item 
occurrances  that  contain  null  B-values,  we  no  longer  need 


106 


Mj  = THj  = m j . This  means  that  R can  be  either  of  type  1 
or  of  type  3,  and  so  we  have  the  following. 

CC(7,  5)  : R j e {R^  e SR  | L is  of  type  1 or  type  3} 

The  remaining  conditions  are  intended  to  guarantee 
that  B = DST  (R_. ) will  be  aggregated  (fixed  or  variable) 
under  at  most  one  other  data  item  type.  The  reason  behind 
this  requirement  was  explained  in  3.2.3  where  we  introduced 
the  constraint  condition  CC(5,  4).  In  order  to  accomplish 
this  we  make  implementation  7 unacceptable  for  R ^ (A,  B)  if 
implementation  5 is  acceptable  for  any  R^  e SR  for  which 
DST(R^)  = DST(RJ  or  if  implementation  6 is  acceptable  for 
any  R^  t SR  for  which  ORG(R^)  = DST(R_.).  This  makes 
implementation  7 unacceptable  for  R^  if  DST(R_.)  is  (fixed) 
aggregated  under  any  data  item  type  in  SI.  And  again  for 
the  same  reasons  as  described  in  3.2.3  (for  implementation 
5),  if  DST(Rj)  can  be  variably  aggregated  under  more  than 
one  other  data  item  types  then  we  choose  the  one  with 
minimum  average  repetition  factor  iik  (or  n^  whichever 
applies)  . 

In  other  words,  we  have  the  following  two  additional 
constraint  conditions  for  implementation  7. 

CC(7,  6)  : R^  is  not  of  type  1 for  any  R^  e SR 

(i  j)  such  that  DSTfRj  = DST  (R  ^ ) and 
R^  is  not  of  type  2 for  any  R^  e SR 


such  that  ORG(Ri)  = DST(R_.) 


107 


CC(7,  7)  : nr  < nu  for  all  i ^ j such  that 

DST(Ri)  = DST (R j ) and  Ri  is  of 
type  3 and 

m.  < n.  for  all  i such  that 
3 i 

ORG(Ri)  = DST ( R ^ ) and  R^  is  of 
type  4 . 

Constraint  conditions  CC(7,  5)  to  CC(7,  7)  are  also  or.ly 
sufficient  conditions. 

The  complete  set  of  constraint  conditions  for  imple- 
mentations 7 and  8 for  R^  is,  then,  as  follows. 


CC  (7, 

1) 

= 

CC(1 

, 1) 

CC  (7  , 

2) 

■ 

Ri 

is  of 

type 

1 

or  3 for  no  i such 

that 

DST (R . ) = ORG (R . ) 
1 3 

CC  (7 , 

3) 

: 

R. 

l 

is  of 

type 

2 

or  4 for  no  i / j such 

that  ORG(Ri)  = ORG(R^) 

CC  (7 , 

4) 

R. 

J 

e SINR 

CC  ( 7 , 

5) 

R . 
J 

e (R. 

e SR 

1 

R^  is  type  1 or  type  3} 

CC  (7  , 

6) 

as 

above 

CC  (7 , 

7) 

• 

• 

as 

above 

Implementation  7 for  R^  (A,  B)  is,  then, acceptable  if 
AC ( 7 , j)  = 1 where: 

AC  ( 7 , j)  - 1 if  CC  ( 7 , 1)  a CC ( 7 , 2)  a CC(7,  3)  A 

A (CC  ( 7 , 4)  v (CC ( 7 , 5)  A CC  ( 7 , 6)  A 
A CC  (7 , 7))) 

= 0 otherwise. 

Similarly,  constraint  conditions  for  implementation  9 
are  as  follows. 


I 


108 


CC (8 , 1) 

S 

CC(1 

, 1) 

CC  (8 , 2) 

• 

R. 

l 

is  of 

type  1 

or  3 

for  no  i f j such 

that  DST(Ri) 

= DST (Rj ) 

CC  (8 , 3) 

« 

• 

R. 

1 

is  of 

type  2 

or  4 

for  no  i such  that 

ORG(Ri) 

= DST (R . ) 

CC (8 , 4) 

R. 

3 

e SONR 

CC ( 8 , 5) 

: 

R . 
3 

e {R± 

e SR  | 

R.  is 

l 

of  type  2 or  type  4 } 

CC (8,  6) 

' 

R. 

l 

is  not 

of  type  1 for  any  R^  e SR  such 

that  DST(Ri)  = ORG(R^)  and 
Ri  is  not  of  type  2 for  any  R^  e SR  U ¥■  j) 


such  that  ORG (R  U = ORG(R_.) 

CC(8,  7)  n^  < n^  for  all  i ^ j such  that 

ORG(R;.)  = ORG(Rj)  and 
rij  < m^  for  all  i such  that 
DST(Ri)  = ORG (Rj ) . 

Implementation  8 for  R^ (A,  B)  is,  then,  acceptable  if 
AC  (8,  j)  = 1 where: 

AC ( 8 , j)  = 1 if  CC  ( 8 , 1)  a CC  (8 , 2)  a CC(8,  3)  a 

a (CC  ( 8 , 4)  v (CC  ( 8 , 5)  A 
A CC  ( 8 , 6)  A CC(8,  7))), 

= 0 otherwise. 

3.2.5  Constraint  Conditions  for  Implementations  9 through  20 
Implementations  9 through  20  are  all  associations  of 
different  types  for  R_.  (A,  B)  . The  odd  numbered  implemen- 
tations are  associations  of  B under  A and  the  even  numbered 


109 


implementations  are  associations  of  A under  B.  The  con- 
straint conditions  for  implementations  9,  11,  13;  15,  17 
and  19  are  identical,  so  are  the  constraint  conditions  for 
implementations  10,  12,  14,  16,  18  and  20.  We  first  devel- 
op constraint  conditions  for  implementation  9 in  detail 
and  use  them  for  implementations  11,  13,  15,  17  and  19  and 
then  we  will  derive  constraint  conditions  for  implementa- 
tions 10,  12,  ...  , 20. 

Implementation  9 is  the  chain  with  next  pointers  asso- 
ciation of  B under  A for  (A,  B) . In  that  sense,  as 
defined  in  3.1.5,  a data  base  set  type  will  be  defined 
where  owner  records  will  contain  unique  A-values  and  member 
records  will  contain  unique  B-values.  The  uniqueness 
requirement  of  A-  and  B-values  in  owner  and  member  records 
together  with  the  fact  that  implementation  9 does  not  use 
duplicate  copies  of  A-  and  B-values  require  that  data  item 
types  A and  B of  R ^ should  not  be  aggregated  under  any  other 
data  item  type.  The  following  constraint  condition  i§ 
sufficient  to  insure  that  neither  A (=  ORG(R^.))nor 
B(=  DST(Rj))  will  be  aggregated  (fixed  or  variable)  under 
any  oilier  data  item  type. 

CC ( 9 , 1)  : AC (5,  i)  = AC (7,  i)  =0  for  all  i t j 

such  that  DST(Ri)  e {ORG(Rj, 

DST(R  ) } 

AC (6 , i)  = AC (7 , i)  = 0 for  all  i ? j 
such  that  ORG (Ri ) t (ORG(R^), 


DST (Rj ) } 


110 


The  next  constraint  condition  of  implementation  9 for 
Rj (A,  B)  is  due  to  a well  known  DBTG  requirement  about  data 
base  set  types.  In  a DBTG  set  type  a single  record  cannot 
participate  in  more  than  one  occurrence  of  the  same  set 
type.  This  has  the  implication  that  a record  may  not  be 
member  in  (or  owner  of)  more  than  one  occurrences  of  the 
same  set  type,  or  even  a member  in  one  occurrence  and 
owner  in  another.  In  other  words,  DBTG  set  types  can  only 
implement  one-to-many  relationships.  This  brings  us  to 
the  next  constraint  condition  for  implementation  9 as 
follows . 

CC  (9 , 2)  : R j e {Ri  e SR  | is  of  type  5,  3 or  1} 

Note  that  in  a type  5 relation,  a B-value  is  related  to  at 
most  one  A-value  but  an  A-value  can  be  related  to  any 
number  of  B-values  (including  zero). 

The  constraint  conditions  for  implementations  9 
through  20  for  R^ (A,  B)  are,  then,  as  follows. 

For  MPL  -•  9,  11,  13,  ...  , 19; 

CC (MPL,  1)  : AC  (5,  i)  = AC (7,  i)  = 0 for  all  i ^ j 

such  that  DST(R;.)  e {ORG(R^), 
DST (Rj ) } 

AC ( 6 , i)  = AC ( 8 , i ) = 0 for  all  i ? j 


such  that  ORG(Ri)  e {ORG (R ^ ) , 
DST (R j ) } 

CC ( MPL , 2)  : R.  e {R.  e SR  | R.  is  of  type  5,  3 or  1} 

J 1 X 


Ill 

Mr 

r 

For  MPL  = 10,  12,  14,  ...  , 20; 

CCtMPL,  1)  : AC (5 , i)  = AC(7,  i)  = 0 for  all  i ^ j 

such  that  DST (Ri ) e {ORG (R ^ ) , 
DST (Rj ) } 

AC (5,  i)  = AC ( 8 , i)  =0  for  all  i ^ j 
such  that  ORG(R^)  e {ORG (R ^ ) , 
DST(R  ) } 

CC (MPL , 2)  : R . e {R . e SR  | R . is  of  type  6,  4 or  2}. 

J * • 

Implementation  MPL  e {9,  10,  11,  ...  , 20}  is,  then,  accep- 
table for  Rj (A,  3)  if  AC(MPL,  j)  = 1 where; 

AC (MPL,  j)  = 1 if  CC (MPL , 1)  and  CC(MPL,  2)  are  both 

true 

= 0 otherwise. 

3.2.6  Constraint  Conditions  for  Implementations  21,  22,  23 
and  24 

Implementations  21,  22,  23  and  24  have  identical 
constraint  conditions.  These  implementations  were  defined 
in  3.1.11  and  3.1.12.  Implementation  21  for  R_.  (A,  B)  is 
the  dummy  record  association  of  A and  B and  the  other  three 
implementations  were  defined  as  linkages.  These  implemen- 
tations are  similar  to  other  associations  with  the  differ- 
ence that  implementation  21  uses  two  data  base  sets  to 
establish  relation  R^ (A,  B)  and  implementations  22,  23  and 
24  do  not  use  data  base  sets  at  all.  This  makes  it 


(possible  to  use  any  one  of  the  above  four  implementations 

to  implement  a many-to-many  relation.  Individual  set  types 


112 


of  implementation  21  still  have  to  satisfy  the  DBTG  conven- 
tion that  the  relationship  between  owners  and  members  of 
data  base  sets  (of  one  type)  must  be  a one-to-many  relation- 
ship. But  the  combination  of  two  set  types  each  of  which 
satisfies  the  DBTG  convention  can  implement  a many-to-many 
relationship  in  a manner  discussed  in  3.1.11. 

The  effect  of  the  observation  made  above  on  the  con- 
straint conditions  for  implementations  21,  ...  , 24  for 
Rj (A,  B)  is  that  no  longer  has  to  be  of  type  5 or  6 
therefore  a constraint  condition  such  as  CC(MPL,  2)  of  3.2.5 
is  no  longer  necessary.  However,  A-  and  B-values  in  the 
records  related  by  any  of  the  implementations  should  still 
be  unique  values  and  hence  the  (only)  constraint  condition 
for  implementations  21,  22,  23  and  24  is  similar  to 
CC  (MPL,  1)  described  in  3.2.5  and  it  is  given  below. 

CC (MPL,  1)  : AC ( 5 , i)  = AC(7,  i)  = 0 for  all  i ^ j 

such  that  DST ( R^ ) t {ORG (R^ ) , 
DST(R J } and 

AC  (6,  i)  = AC ( 8 , i)  = 0 for  all  i ? j 

such  that  ORGfRj  c {ORG  (R^  ) , 
DST (Rj ) }. 

Accordingly  implementation  MPL  is  acceptable  for  R^  if 
AC (MPL,  j)  =1  where  for  MPL  -21,  22,  23  and  24: 

AC (MPL,  j)  = 1 if  CC (MPL,  1)  is  true 
= 0 otherwise. 

This  concludes  the  discussion  of  constraint  conditions 
for  all  implementation  alternatives.  It  is  evident  that 


113 


some  of  these  constraint  conditions  depend  on  the  accepta- 
bility of  other  relation  implementations.  Therefore  the 
acceptability  of  3ome  implementations  for  a relation  should 
be  determined  before  others.  In  particular,  the  accepta- 
bility of  aggregations  (implementations  5,  6,  7 and  8)  must 
be  determined  before  all  other  implementations  except  the 
fixed  duplications  (implementations  1 and  2).  This  is 
because  the  variables  AC (5,  j),  AC (6,  j),  AC (7,  j)  and 
AC  (8,  j)  that  record  the  acceptability  of  implementations 
5,  6,  7 and  8,  respectively,  for  relation  appear  in  the 
definitions  of  the  constraint  conditions  for  all  other 
implementations  for  (except  implementations  1 and  2 for 


Therefore  we  first  determine  AC(MPL,  j)  for  MPL  =1,  2, 
then  we  continue  with  MPL  e {5,  6,  7,  8}  and  then  we  can 
determine  AC (MPL,  j)  for  all  other  implementations  namely 
for  MPL  e {3,  4,  9,  10,  ...  , 24].  For  each  MPL  we  deter- 
mine AC (MPL,  j)  for  all  j c {1,  ...  , NR).  We  note, 
however,  that  the  order  in  which  relations  are  considered 
does  not  have  any  effect  on  the  results  of  AC (MPL,  j)  eval- 
uations . 


3.3  Storage  Costs 

In  this  section  we  will  define  the  storage  cost  of 
each  implementation  alternative  defined  in  section  3.1. 

The  storage  costs  are  measured  in  the  number  of  characters 
which  is  the  same  unit  used  to  define  sizes  of  data  items. 


114 


address  occurrences  (pointers),  page  sizes,  etc.  The 
storage  space  requirement  of  a relation  implementation  is 
the  amount  of  storage  space  needed  to  store  data  item 
occurrences,  pointers,  counter  occurrences  and  privacy 
locks  as  required  by  the  implementation  to  establish  the 
relation  in  the  data  base. 

It  is  evident  that  one  occurrence  of  each  data  item 
value  of  every  type  must  be  stored  somewhere  in  the  data 
base.  There  are  NT  data  item  types  in  SI.  Size  and 
cardinality  of  1^  e SI,  for  i = 1,  2,  ...  , NI  are  given 
by  ITSZ(i)  and  ITCR(i),  respectively.  The  total  storage 
space  required  to  store  one  occurrence  of  each  data  item 
value  of  every  type  is  denoted  by  MMSB  and  given  by  the 
following  sum. 

NI 

MMSB  = E ITSZ(i)  * ITCR(i) 
i=l 

The  storage  cost  of  a r lation  implementation  is 

defined  to  be  the  amount  of  storage  space  required  to 

implement  the  relation  beyond  the  amount  of  storage  space 

required  to  store  one  occurrence  of  each  data  item  value 

of  the  relation’s  component  data  item  types.  For  example, 

the  total  storage  requirement  of  implementation  1 for 

relation  R.(A,  B)  is  a'.  a’.'  + S'.  3’.'  + a’.  * M.  * 6 ’.* . In  this 

3 1 3 3 3 3 3 3 

expression  the  first  and  the  second  terms  give  the  amount 
of  storage  required  to  store  one  occurrence  of  each  data 
item  value  of  types  A and  B,  respectively.  The  third  term 
is  the  storage  space  needed  to  store  duplicate  B-values  as 


115 


required  by  implementation  1 (see  Section  3.1.1).  The 
storage  cost  of  implementation  1 for  R_.  (A,  B)  is 
aj  * Mj  * 8 j * and  the  first  two  terms  are  accounted  for  in 
MMSB  and  therefore  deleted  from  storage  costs. 

The  amount  of  storage  left  for  optimization  is  then 
given  by  OMSB  where: 

OMSB  = TMSB  - MMSB 

The  storage  required  for  privacy  locks  is  calculated 
on  the  assumption  that  for  every  relation  one  lock  will  be 
defined  in  the  schema  for  each  run  unit  that  uses  the  rela- 
tion for  every  appropriate  access  type.  If  the  implementa- 
tion of  the  relation  R ^ requires  the  definition  of  a set 
type  and  RUUR_.  is  the  number  of  run  units  that  use  relation 

R.  then  NATS  * RUUr.  locks  must  be  defined  where  NATS  is 
3 3 

the  number  of  access  types  to  a DBTG  set.  For  a lock  size 
of  PLOKS  characters,  then,  this  would  require  a total  of 
TLSj  = PLOKS  * NATS  * RUURj  characters  of  storage  for  pri- 
vacy locks.  This  formula  can  be  used  for  implementations  9 
through  20.  Implementation  21  for  R^  uses  two  data  base 
sets  and  privacy  locks  are  defined  for  the  two  data  base 
sets  separately  at  a total  storage  requirement  of  2 * TLS ^ 
characters.  Implementations  22  and  23  for  R^  (defined  in 
3.1.12)  require  the  definition  of  a data  item  type  that 
contains  data  base  key  values.  The  protection  of  the  rela- 
tion in  this  case  is  achieved  through  the  privacy  locks 
defined  for  this  data  item.  The  total  storage  required  for 
privacy  locks  in  case  of  implementations  22  and  23  for  R^ 


116 


is  then  given  by 

TLD  . = PLOKS  * NATD  * RUUR . 

3 3 

where  NATD  is  the  number  of  access  types  to  a data  item  in 
a DBTG  system  and  PLOKS  and  RUUR^  are  defined  above.  Simi- 
larly implementation  24  for  R ^ requires  2 * TLD  ^ characters 
for  privacy  locks,  because  in  this  implementation  two  data 
items  should  be  defined  in  two  different  record  types.  We 

will  also  use  the  two  variables  TI.S . and  TLD.  in  derivations 

3 3 

of  storage  costs  in  this  section  without  further  explanation 
of  their  expressions. 

The  number  of  access  types  to  a data  base  set  and  a 
data  item  (NATS  and  NATD) , respectively,  are  the  numbers  of 
different  DML  commands  that  a particular  DBTG  implementation 
provides  for  data  base  sets  and  data  items  for  which 
privacy  locks  may  be  defined.  Privacy  locks  may,  for 
example,  be  specified  which  apply  to  the  use  of  a DBTG  set 
type  for  DML  commands:  FIND,  ORDER,  INSERT  and  REMOVE  and 

which  apply  to  the  member  record  type  of  a set  type  for  DML 
commands:  FIND,  INSERT  and  REMOVE.  Privacy  locks  may  be 

specified  which  apply  to  data  item  types  for  DML  commands: 
STORE,  MODIFY  and  ERASE.  There  are  other  privacy  locks 
which  apply  to  the  uses  of  record  types,  schema,  sub-schema, 
realms,  etc.  which  we  are  not  taking  into  consideration 
either  because  we  are  not  modelling  the  underlying  concepts 
(such  as  privacy  locks  for  realms)  or  because  they  are 
irrelevant  to  the  optimization  problem  (such  as  privacy 
locks  for  record  types) . Privacy  locks  on  data  item  types 


117 


in  SI  are  also  not  considered  because  their  storage 
requirements  would  be  the  same  for  all  implementations. 


3.3.1  Storage  Costs  of  Fixed  Duplications 

Storage  cost  of  implementation  1 for  (A,  B)  which 
is  the  fixed  duplication  of  B under  A (defined  in  3.1.1)  is 
the  space  required  to  store  fixed  sized  vectors  of  B-values 
that  we  called  fd(B)  in  3.1.1.  The  number  of  these  vectors 
is  a ! , each  one  has  NT  elements  of  size  BV  hence  the  total 
storage  cost  of  implementation  MPL  = 1 for  relation  R ^ is 
given  by: 


SC  ( 1 , j)  = aj 


M.  • (J. 


Similarly,  the  storage  cost  of  implementation  2 for 

R . (A,  B)  is  given  by: 

SC  (2,  j)  = B ' * N.  * a".. 

3 3 3 


3.3.2  Storage  Costs  of  Variable  Duplications 
Implementations  3 and  4 for  R ^ (A,  B)  were  defined  in 

3.1.2  as  variable  duplications  (of  B under  A and  A under  B) , 
respectively) . The  storage  cost  of  implementation  3 for 

(A,  B)  is  the  space  required  to  store  the  counter  occur- 
rences c^  and  variable-size  vectors  vd^(B)  of  duplicate 
B-values,  i = 1,  ...  , a ( . The  average  number  of  elements 
of  vd^ (B)  over  i e {1,  ...  , a!}  is  r^/a^.  The  size  of 
each  element  is  characters.  Therefore  the  storage  cost 
of  implementation  3 for  R.(A,  B)  is  given  by: 

SC  (3,  j)  = a'.  *(r  ./«')*  ?"■  + a'.  * CNTRS 
J 3 3 3 3 3 

= r . * B " + a'.  * CNTRS 

3 *3  3 


118 


where  CNTRS  is  the  size  of  c.. 

i 

Similarly  the  storage  cost  of  implementation  4 for 
Rj (A,  B)  is  given  by  the  following. 

SC (4,  j)  = r.  * a’.'  + (3*.  * CNTRS. 

3 3 3 

3.3.3  Storage  Costs  of  Fixed  Aggregations 

The  storage  cost  of  implementation  5 for  R^(A,  B) , 
defined  as  the  fixed  aggregation  of  B under  A,  is  the  space 
required  to  store  fa^ (B)  , i = 1,  ...  , vectors  defined 
in  3.1.3.  However,  since  the  occurrences  of  B-values  in 
these  vectors  are  shared  occurrences,  the  space  required  to 
store  one  occurrence  of  each  B-value,  p^  * pv,  is  accounted 
for  in  MMSB.  The  storage  cost  of  implementation  5 is  then 
the  difference  between  these  two  storage  space  requirements: 

SC  ( 5 , j)  = a*  * M.  * 87  - S!  * S'.1  = (a'.  * M.  - 6!)  * 8'.' 

3 3 3 3 3 3 3 3 3 

Similarly  storage  cost  of  implementation  6 for  R_.  {A,  B) 


SC (6 , j)  = <p!  * N j - a j ) * a j * 

If  relation  R.(A,  B)  is  of  type  1 then  p!  = a'.  * M.,  in 
3 3 3 3 

which  case  SC(5,  j)  = 0.  Similarly  SC(6,  j)  = 0 if 
R ^ ( A , B)  is  of  type  2. 

3.3.4  Storage  Costs  of  Variable  Aggregations 

Implementations  7 and  8 for  R_.(A,  B)  were  defined, 
in  3.1.4,  as  variable  aggregations  (of  B under  A and  A 
under  B,  respectively) . The  storage  cost  of  implementation 


7 for  Rj (A,  B)  is  the  space  required  to  store  the  counter 
occurrences  c^  and  variable-size  vectors  va^(B)  of  B-values, 


119 


i = I,  ...  f a j • Average  size  (number  of  elements)  of 

vai(B)  over  i = 1,  , is  r_./a^.  The  total  space 

requirement  of  va. (B)  vectors  (in  characters)  is,  then, 

c'.  * (r  ./a'.)*  B'.’  = r . * B . 

9 3 1 3 3 j 

However,  as  in  the  case  of  implementation  5,  since  occur- 
rences of  B-values  in  va^(B)  vectors  are  shared  occurrences. 
They  are  accounted  for  in  MMSB  and  therefore  we  must  sub- 
tract from  the  above  value,  the  storage  required  to  store 
one  occurrence  of  each  3-value  which  is  equal  to  p_^  * p V . 

The  total  space  requirement  of  c^  counters  is  a_!  * CNTRS . 
Hence  the  storage  cost  of  implementation  7 for  R_. (A,  B)  is: 
SC  (7,  j)  = (r  j - p^)  * pV  + a.j  * CNTRS. 

Similarly,  the  storage  cost  of  implementation  8 for 
Rj (A,  B)  is  as  follows, 

SC ( 8 , j)  = (r j - a\ ) * aV  + p^  * CNTRS. 

Note  that  if  R.  is  of  type  3 (4)  then  r . = p (r.  = a!). 

3 3 F3  3 3 

3.3.5  Storage  Costs  of  Chain  Associations 

Implementations  9 through  16  were  defined  in  3.1.5 
through  3.1.8  as  different  types  of  chain  associations. 

They  all  use  a single  data  base  set  type  to  implement  a 
relation.  In  the  derivations  of  storage  cost  implementa- 
tions 9-16,  which  follows,  PTRS  denotes  the  size  of  a 
data-base-key  (pointer  size)  and  TLS^  is  defined  at 
beginning  of  this  section. 

The  storage  cost  of  implementation  9 for  R^ (A,  B) 
called  chain  with  next  pointers  association  of  B under  A 


120 


consists  of  storage  required  for  mp.  and  np,  pointers 

X K 

(see  3.1.5)  for  i = 1,  ...  , a\  and  k = 1,  ...  , pj.  There- 
fore with  a pointer  site  of  PTRS  and  privacy  lock  require- 
ment of  TLSj,  the  storage  cost  of  implementation  9 for 
R j (A,  B)  is  given  by: 

SCO,  j)  = a*.  * PTRS  + S'.  * PTRS  + TLS  . 

3 P3  3 

= ( a + B ' ) * PTRS  + TLS  . 

3 3 3 

Similarly  the  storage  cost  of  implementation  10  for 
Rj (A,  B)  is  given  by: 

s:(10,  j)  = (cM  + p ! ) * PTRS  + TLS ^ 

For  implementation  11  (for  R_.  (A,  B) ) which  is  the 
chain  with  next  and  owner  pointers  association  of  B under 
A,  defined  in  3.1.6,  the  storage  cost  is  due  to  the 
storage  space  requirement  of  the  three  sets  of  pointers 


mp.,  np,  and  op,  (i  = 1,  ...  , a\  and  k = l,  ...  , p ! ) . 

l k k 3 j 

The  storage  requirement  of  mp. , np  and  op  pointers  are 

1 )C  K 

a!  * PTRS,  p^  * PTRS  and  p^  * PTRS  characters,  respectively. 
Hence  the  storage  cost  of  implementation  11  for  R^ (A,  B)  is 
as  follows. 


SC(11,  j)  = (o^  4 2 * PM  * PTRS  4 TLS  j 

Similarly  the  storage  cost  of  implementation  12  for 
Rj (A,  B)  is  given  by: 

SC  (12,  j)  = (2  * a ! 4 p ) * PTRS  4 TLS  . . 

3 3 3 

In  the  case  of  implementation  13  for  R ^ (A,  B) , which 
is  the  chain  with  next  and  prior  pointers  association  of 
B under  A,  defined  in  3.1.7,  there  are  four  sets  of 
pointers  to  consider.  These  sets  of  pointers  were  defined 


121 


as  mpi#  lpL,  npk  and  ppk,  i * 1,  . . . , a' , k ■ 1 , . . . # p • 
The  storage  requirements  of  these  sets  are  * PTRS, 

* PTRS,  Pj  * PTRS  and  p^  * PTRS,  respectively.  The  stor 
age  cost  of  implementation  13  for  R ^ (A,  B)  is,  then,  given  by 

SC (13,  j)  = 2 * ( + p^)  * PTRS  + TLS  ^ . 

Similarly  the  storage  cost  of  implementation  14  for 
Rj (A,  B)  is  as  follows. 

SC  (14,  j)  = 2 * (Qj  + Pj)  * PTRS  + TLS j . 

In  evaluating  the  storage  cost  of  implementation  15 
for  Rj (A,  B) , defined  in  3.1.8  there  are  five  sets  of 
pointers  to  be  considered.  The  first  two  sets  are  mp.^  and 
£p^,  i = 1,  ...  , a ^ pointers,  in  owner  records  and  the 
next  three  are  npk,  ppk  and  opk>  k = 1,  ...  , p^  pointers 
in  member  records.  The  total  storage  cost  of  implementa- 
tion 15  for  Rj (A,  B)  is,  then,  given  by: 

SC  (15,  j)  = (3  * p'.  + 2 * a'.)  * PTRS  + TLS  . 

3 3 3 

Similarly,  storage  cost  of  implementation  16  for 

R j ( A , B)  is : 

SC  (16,  j)  = (3  * a'.  + 2 * P ) * PTRS  + TLS  . . 

3 J 3 

3.3.6  Storage  Costs  of  Pointer  Array  Associations 

Implementations  17,  18,  19  and  20  were  defined  in 
3.1.9  and  3.1.10  as  pointer  array  associations.  We  will 
now  present  the  storage  costs  of  these  four  implementations 

In  implementation  17  for  R^ (A,  B) , which  is  the 
pointer  array  association  of  B under  A,  the  storage 
requirement  for  vp^,  i = 1,  ...  , a(  pointers  is  * PTRS 


122 


and  the  pointer  arrays  pa^,  i = 1, 


• , a . in  owner 
3 


records  of  this  implementation  require  Mj  * PTRS  characters 
for  each  pa^ . The  total  storage  cost  of  implementation  17 
for  Rj (A,  B)  is,  then,  as  follows. 

SC  ( 1 7 , j ) = ( a * M . + a ) * PTRS  + TLS  . 

3 3 3 3 

Similarly,  the  storage  cost  of  implementation  18  for 
Rj (A,  B)  is  given  by, 

SC  ( 1 8 , j)  = (P^  * Nj  + * PTRS  + TLS  j . 

Implementation  19  for  R ^ (A,  B) , called  the  pointer 
array  with  owner  pointers  association  of  B under  A,  in 
addition  to  the  vp^  and  pa^,  i = 1,  ...  , pointers. 


includes  the  owner  pointers,  op^ , k = 1, 


, p in  its 
3 


member  records.  The  total  storage  cost  of  implementation 
19  for  Rj (A,  B)  is  then  as  follows. 

SC  (19,  j)  = ( a ^ * Mj  + + p!)  * PTRS  + TLS  j 

Similarly  the  storage  cost  of  implementation  20  for 
Rj (A,  B)  is  given  by: 

SC  ( 2 0 , j ) = ( P * N . + p + a '. ) * PTRS  + TLS  . . 

3 3 3 3 3 


3.3.7  Storage  Costs  of  Dummy  Record  Association  and  Linkages 
Implementation  21  for  Rj (A,  3)  was  defined  in  3.1.11 
as  the  dummy  record  association  of  A and  B.  In  this  imple- 


men 


tation  the  two  pointer  sets  mp^  and  mp£  (i  = 1 , 


• ■ “j 


and  4=1,  ...  , pV)  require  Oj  * PTRS  and  p(  * PTRS  char- 
acters of  storage,  respectively.  Each  dummy  record  of  this 
implementation  requires  4 * PTRS  characters  and  there  are 
r j such  records.  Also,  since  two  data  base  set  types  are 


123 


defined  for  this  implementation,  the  storage  space  required 
for  privacy  locks  is  2*TLS_.  as  discussed  at  the  beginning  of 
this  section.  The  total  storage  cost  of  implementation  21 
for  Rj (A,  B)  is,  then,  given  by: 

SC  (21 , j)  = (a!  + p^  + 4 * r j ) * PTRS  + 2 * TI.S  ^ . 

Implementation  22  for  (A,  B) , defined  in  3.1.12,  is 

the  single  linkage  of  B under  A.  The  storage  cost  of  this 

implementation  consists  of  the  space  required  to  store  kp^ 

vectors  of  pointers  and  TLD_.  characters  for  privacy  locks 

discussed  at  the  beginning  of  this  section.  Each  kp^, 

i = 1,  ...  , a!  has  M.  elements  of  size  PTRS  characters, 

3 3 

so  the  total  storage  cost  of  implementation  22  for  R ^ (A,  B) 
is  given  by: 

SC  (22,  j)  = a'.  * M.  * PTRS  + TLD . 

J 3 3 3 

Similarly  the  storage  cost  of  implementation  23  and  24 
for  Rj(A,  B)  (defined  in  3.1.12)  are  given  by  the  following 
expressions . 


SC  (23,  j)  = P'.  * N.  * PTRS  + TLD. 

3 3 3 

SC  (24,  j)  = (a1.  * M.  + p*.  * N.)  * PTRS  + 2 * TLD.. 
J 3 3 3 3 3 


3.4  Time  Costs 

Our  basic  assumption  is  that  the  data  base,  residing 
on  secondary  storage  is  accessed  in  a paging  environment. 
Only  a certain  number  of  pages  (of  data)  can  be  in  main 
storage  at  any  given  time.  We  assume  this  number  to  be 
MSPG.  When  a record  access  is  required,  if  the  record  is 
stored  in  a page  that  is  currently  in  main  storage  then  no 


L24 


secondary  storage  access  is  required.  The  record,  in  this 
case,  is  simply  moved  front  system  buffers  to  the  user  work- 
ing area  according  to  the  rules  of  the  data  base  management 
systems.  If,  however,  the  accessed  record  is  stored  in  a 
page  that  is  not  in  main  storage  at  the  time  of  access,  the 
entire  page  that  contains  the  record  must  first  be  read  from 
secondary  storage  into  the  system  buffers  before  it  can  be 
delivered  to  the  user  program.  For  every  record  access, 
then,  there  is  a probability  that  a (missing)  page  fault 
occurs  and  a secondary  storage  access  is  required.  The 
expected  rate  of  these  secondary  storage  accesses  is  what 
we  are  trying  to  minimize. 

In  this  section  we  derive  the  expressions  for  computing 
the  expected  rate  of  page  faults  for  each  relation  imple- 
mentation. 

The  time  cost  of  implementation  numbered  MPL  for  rela- 
tion R_.  is  defined  to  be  the  number  of  page  faults  expected 
to  occur  as  a result  of  execution  of  run  units  specified  in 
the  user  activities  if  implementation  MPL  is  used  for  rela- 
tion R.  This  value  will  be  denoted  by  TC(MPL,  i)  and  it 
3 

is  defined  only  if  implementation  numbered  MPL  is  accep- 
table for  relation  R^ , i.e.,  AC(MPL,  j)  = 1.  Ne  recall 
that  SR  was  defined  in  Chapter  2 as  the  set  of  data  base 
relations  of  the  application  SR  = {R^,  R2,  •••  » rnr^*  A 
configuration  C for  a set  of  data  base  relations  SR  is  a 
set  of  ordered  pairs  of  relation  implementations: 


125 


C = {<R1,  MPL1>,  <R2,  MPL2>,  <Rj , MPL j > , . .., 

"W  mplnr>> 

where  MPL ^ , j =1,  . ..,  NR  is  an  implementation  number  (an 
integer  from  1 to  24)  designating  one  of  the  24  implemen- 
tation alternatives  defined  in  Section  3.1.  Note  that  in 

the  definition  of  C,  R.  = R.  iff  i = j.  for 

l j J 

i,  j e (1,  2,  ...»  NR},  but  MPL^  may  be  equal  to  MPL  ^ when 

i I*  j- 

An  acceptable  configuration  C for  SR  is  defined  as 
follows. 

C = {<Rj , MPL j > | j = 1,  . . . , NR  and  AC  (MPL..,  j)  = 1 

for  all  j } 

The  (cumulative)  storage  cost  of  an  acceptable  config- 
uration C for  SR  is  the  sum  of  storage  costs  of  individual 
relation  implementations  in  the  configuration  and  it  is 
given  by: 

NR 

CS(c)  = 2 SC (MPL . , j) 

j-1  3 

Similarly,  the  (cumulative)  time  cost  of  an  acceptable  con- 
figuration C is  defined  as  follows: 

NR 

CT(c)  « 2 TC (MPL . , j) 

j=l  3 

This  is  the  total  time  cost  of  a data  base,  measured  in  the 
expected  number  of  page  faults,  and  it  is  the  sum  of  the 
time  costs  of  exactly  one  acceptable  implementation  for 
each  relation  in  SR  (the  set  of  relations  defined  in  the 
application) . 


126 


CT (c)  is  the  quantity  that  is  subject  to  minimization, 
in  the  optimization  algorithm  of  Chapter  4,  over  all  accep- 
table configurations  c for  which  CS(c)  is  not  greater  than 
a given  limit. 

In  order  to  compute  the  time  cost  of  a relation  imple- 
mentation we  have  to  accumulate  the  expected  page  fault 
rates  of  all  operations  (of  all  run  units)  in  which  the 
relation  is  used.  Let  0^  be  the  £-th  operation  of  k-th  run 
unit  and  let  implementation  MPL  be  acceptable  for  relation 


Rj  t SR,  i.e.,  AC (MPL,  i)  = 1.  We  denote  the  expected  num- 
ber  of  page  faults  resulting  from  the  execution  of  0^,if 

V 

implementation  MPL  is  used  for  R.,by  TC(MPL,  j,  0.)  and  call 

J ^ 

k 

it  the  time  cost  of  implementation  MPL  for  R . in  0. . The 

J ^ 

V 

value  of  TC (MPL , j,  0 ^ ) is  undefined  if  AC (MPL,  j)  = 0 

V 

and  it  is  zero  if  relation  R ^ is  not  used  in  operation  0^. 

Recalling  that  NO^  denotes  the  number  of  operations  in 
the  k-th  run  unfit  and  NU  denotes  the  number  of  run  units, 


the  time  cost®  for  relation  implementations  can  be  computed 


from: 


NU  NCR 

TC  (MPL , j)  = Z Z TC  (MPL,  j,  0*) 
k=l  fc=l 


In  the  following  we  will  present  the  formulas  to  com- 

Jr 

pute  TC (MPL,  j,  0^)  for  all  implementations  defined  in  3.1. 
These  formulas  involve  certain  types  of  page  fault  proba- 
bilities which  we  will  discuss  first. 


i 


I 


127 


pj'V 


3.4.1  Page  Fault  Probability  Model 

We  define  four  types  of  one  step  page  fault  proba- 
bilities as  follows. 

p^  is  the  probability  of  occurrence  of  a page 

fault  when  a record  is  accessed  using  the 
value  of  a data  item  in  the  record  (data 
key  value) 

is  the  probability  of  occurrence  of  a page 

fault  when  the  next  member  record  of  a set 

in  chain  mode  is  accessed  from  the  current 

member  record  of  the  set.  p^(Rj)  applies 

to  chain  associations  of  B under  A for 

R . (A,  B)  . p'(R.)  is  similarly  defined  for 

3 ^3 

chain  associations  of  A under  B for 
Rj (A,  B)  . 

is  the  probability  of  occurrence  of  a page 
fault  when  a related  record  is  accessed 
from  another  record  in  an  association  or 
linkage  implementation  for  R ^ . In  associa- 
tion implementations  p_(R.)  is  the  page 

3 

fault  probability  for  an  owner  to  member 
(or  member  to  owner)  record  access, 
is  the  probability  of  occurrence  of  a 
page  fault  when  a page  that  contains 
security  information  (values  of  privacy 
locks)  is  accessed. 


P3(V 


128 


Our  page  fault  probability  model  makes  certain  assump- 
tions which  we  wish  to  discuss  at  this  point.  It  is 
assumed  that  the  data  base  will  be  divided  into  subdivisions 
called  files.  Each  file  consists  of  all  records  of  only 
one  type  (homogeneous  files)  . No  physical  ordering  is 
assumed  for  records  of  a file.  Files  are  subdivided  into 
pages  of  PGSZ  characters.  Only  a certain  number  MSPG  of 
pages  of  data  can  be  in  main  storage  at  a given  time.  With 
pages  of  size  PGSZ  characters  this  accounts  for  a total  of 
PGSZ  * MSP G characters  of  main  storage  which  we  call  the 
page  buffer  size  PBSZ. 

When  a record  is  accessed  through  its  data  key  value, 
the  value  of  a certain  data  item  type  is  used  (by  the 
system)  to  directly  locate  the  record.  For  example,  sup- 
pose relation  (A,  B)  is  implemented  by  a chain  with  next 
pointers  association  of  B under  A (implementation  9 defined 
in  3.1.5),  then  each  A-value  is  stored  in  a record  that  is 
the  owner  of  a set  that  contains  the  records  in  which 
B-values  related  to  the  A-value  in  R ^ are  stored.  Now  for 
each  execution  of  an  operation  such  as 

0 = {FIND  ALL,  B,  A,  R_.  , f},  all  B-values  related  in  R^  to 
a specific  A-value,  say  a,  must  be  retrieved.  The  A-value 
will  be  provided  at  execution  time  and  it  is  based  on  this 
value  that  the  system  will  directly  access  the  record  that 
contains  it. 


For  this  type  of  access  no  assumptions  can  be  made 
regarding  the  contents  of  the  system  buffer,  because  we  do 


129 


not  require  that  the  f executions  of  the  operation  0 be 
performed  consecutively.  For  a direct  record  access,  then, 
any  part  of  the  data  base  (of  total  size  TMSB  characters) 
can  be  in  the  buffer  (whose  size  is  PBSZ  characters)  at 
the  time  of  access.  The  probability  of  the  record  that 
contains  a specific  A-value  being  in  storage  is (PBSZ/TMSB) 
and  the  page  fault  probability  p^  is  as  follows. 

Px  = 1 - (PBSZ/TMSB) 

Note  that  0 < p^  < 1 because  we  assume  0 < PBSZ  < TMSB. 

The  assumption  that  PBSZ  is  less  than  TMSB  is  a trivial 
one  because  otherwise  the  whole  data  base  can  be  accomodated 
in  the  system  buffer  and  any  structure  chosen  for  the  data 
base  would  cost  the  same  in  terms  of  generating  page 
faults.  Another  assumption,  implicit  in  the  above  deriva- 
tion, is  that  records  of  the  data  base  are  accessible  with- 
out the  requirement  of  a sequential  scan  of  the  data  base 
or  a sequential  search  of  collections  of  records.  In  DBTG 
terms,  the  data  item  being  used  as  data  key  value  in  a 
rcrnid  must  be  declared  as  a CALC  KEY  in  the  record  type. 

We  now  proceed  with  the  derivation  of  p_(R.)  and 

*•  3 

p^(R  ) defined  above.  Refering  back  to  the  example  opera- 
tion used  above  for  derivation  of  p^ , we  note  that  when 
the  record  that  contains  the  proper  A-value,  a,  is 
retrieved,  then  a chain  of  member  records  must  be  retrieved. 
The  first  member  record  is  accessed  using  a pointer  in  the 
owner  record.  This  is  an  owner  to  member  record  access 
that  we  discuss  later.  But  the  rest  of  the  record  accesses 


130 


needed  to  retrieve  all  other  member  records  that  contain 
the  desired  B-values,  are  performed  using  the  next  member 
pointers  in  the  member  records.  This  is  the  type  of  access 
refered  to  in  the  definition  of  p^iR^) • Since  we  assumed 
that  records  of  the  same  type  are  stored  in  one  file,  we 
have  a smaller  range  of  records  stored  in  pages  of  a cer- 
tain file  with  these  accesses  being  consecutively  per- 
formed. In  this  case,  then,  the  accessed  record  may  be  in 
any  of  x^  pages,  where  x^  is  the  number  of  pages  of  the 
file  that  contains  records  of  the  desired  type.  It  is 
assumed  that  all  MSPG  pages  in  main  storage  contain  records 
of  the  member  record  type.  This  assumption  may  result  in 
approximate  computation  of  page  fault  probabilities, 
especially  in  the  first  few  member  record  accesses,  for 
some  paging  algorithms  used  by  the  system.  The  exact  com- 
putation of  these  probabilities  requires  detailed  modelling 
of  paging  policies  which  is  beyond  the  scope  of  this 
research  and  some  information  about  the  final  record  and 


set  structures.  We  will  assume  that  the  system  will,  in 
some  way,  fill  the  MSPG  main  storage  pages  with  the  records 
of  the  desired  type  when  a chain  of  member  records  is  being 
traversed.  A similar  assumption  will  be  made  when  com- 


puting p, (Rj ) , 


namely  that  when  a related  record  is  accessed 


the  MSPG  pages  of  main  storage  are  filled  with  records  of 


the  owner  and  member  record  types. 


In  case  of  relation  R^  (A,  B)  where  B = DST(Rj)  and 
p!  = | B | , there  are  pf  unique  B-values  stored  in  p!  records. 


131 


and  therefore  Tj  = (p^  * ARL)/PGSZ  where  APL- is  the  Average 

Record  Length  and  PGSZ  is  the  size  of  a page.  p (R.)  is 

* ] 

then  given  by  the  following. 

...  MSPG  MSPG  * PGSZ  , PBSZ 

P2{V  - 1 — - 1 - "py—  ARL  = 1 - pj~*  ARL 

The  above  formula  holds  as  long  as  PBSZ  < p*  * ARL.  For 

PBSZ  > p'  * ARL  we  note  that  the  entire  file  that  contains 

the  records  of  the  desired  type  can  be  accommodated  in  the 

page  buffer  and  therefore  in  this  case  we  set  p2(Rj)  equal 

to  zero. 


„ / r>  X _ 1 PBSZ 

P2  Rj  1 ' p'.  * ARL 


= 0 


if  PBSZ  < p^  * ARL 
otherwise 


Similarly  p^tRjK  which  is  the  page  fault  probability 
when  the  next  record  in  a chain  association  of  A under  B 
for  Rj (A,  B)  is  accessed,  is  given  by  the  following. 


PBSZ 


p^(Rj)  1 - Q,  * ARL  --  - -3 


if  PBSZ  < a!  * ARL 
othe rwise , 


= 0 

where  = |ORG(Pj) |. 

The  above  formulas  for  p (R.)  and  p'(R.)  hold  for  all 

z 3 ^3 

association  implementations  except  implementation  21  for 
Rj  (A,  B)  where  ARL  and  a.!  or  pj  parameters  are  different. 
This  case  will  be  discussed  later. 

The  next  category  of  page  fault  probabilities  to  be 
discussed  is  the  one  defined  as  p^(R  ) or  the  probability 
of  occurrence  of  a page  fault  when  a related  record  is 
accessed.  In  a chain  or  a pointer  array  association  imple- 


132 


mentation,  when  an  owner  record  is  accessed  using  an  owner 
pointer  in  a member  record  or  a member  record  is  accessed 
using  a member  pointer  in  an  owner  record  we  say  that  a 
related  record  has  been  accessed.  Accesses  made  to 
records  related  by  linkage  implementations  using  pointer 
data  items  of  the  implementation  is  also  called  a related 
record  access.  As  in  the  case  of  P2(Rj)  discussed  above, 
p^(Rj)  applies  only  to  association  and  linkage  implementa- 
tions of  R_.  (A,  B)  . These  implementations  for  R^  (A,  B)  all 
require  unique  A and  B-values  to  be  stored  in  different 
records.  The  number  of  record  occurrences  involved  is 
therefore  known  and  using  ARL  as  before  the  sizes  of  the 
associated  files  can  be  estimated.  The  record  types  that 
contain  data  item  types  A and  B have  and  p^  records, 
respectively,  where  a!  = |ORG(R^)|  and  p^  = [DST(RJ  |. 

Sizes  of  the  associated  files  is  then  c!  * ARL  and  p * ARL 

3 3 

characters  respectively,  where  ARL  is  the  Average  Record 
Length.  Assuming  that  any  parts  of  these  two  files  may  be 
in  main  storage  at  the  time  of  access,  the  formula  for 
p^(Rj)  is  given  by  the  following. 

p3(V  ' 1 ' •reivqrr-aa;  if  PBSZ  5 * flRL 

= 0 otherwise, 

where  again  PBSZ  = MSPG  * PGSZ  is  the  size  of  the  page 
buffer  in  characters. 

Finally  for  p4,  the  probability  of  occurrence  of  a 
page  fault  when  a page  that  contains  security  information 


I 


133 


is  accessed  we  use  an  average  value  based  on  the  assumption 
that  one  page  of  main  storage  (FGSZ  characters)  is  devoted 
to  security  information.  Let  LKSZ  be  the  maximum  amount  of 
storage  needed  for  security  locks  over  all  possible  config- 
urations. If  a configuration  needs  S bytes  then  this 
probability  is, 

p = 1 if  PGSZ  2 S and  p = 0 if  PGSZ  > S, 

-where  0 2 S 2 LKSZ.  The  average  value  of  p denoted  by  p^ , 
assuming  S varies  continuously  from  zero  to  LKSZ,  is 


. PGSZ  j,  . . , i LKSZ. 

1 - LKSZ  11  + 1"  PGSZ1' 

For  example,  for  LKSZ  = 2400  characters  and  PGSZ  = 2000  we 
have  p^  ~ 0.01,  and  for  the  same  LKSZ  and  PGSZ  = 4000; 

P4  = 0. 

Page  fault  probabilities  p_(R),  pI(R.)  and  p (R.)  do 

z 3 ^3  4 3 

not  apply  to  implementation  21.  In  the  case  of  this  imple- 
mentation for  R_.  (A,  B)  we  note  that  for  a related  record 
access  (p^(R^))  there  are  three  files  of  records  whose 
pages  will  be  competing  for  the  MSPG  available  pages  in 
main  storage.  The  first  two  files  are  associated  with  the 
record  types  in  which  A and  B are  defined,  respectively. 
These  two  files,  respectively,  have  and  records  of 
average  length  ARL.  The  third  file  contains  the  r^  dummy 
records  of  size  DRSZ  of  implementation  21.  In  our  defini- 
tion of  implementation  21,  the  value  of  DRSZ  is  given  by 
DRSZ  = 4 * PTRS , where  PTRS  is  the  size  of  an  address 
occurrence  (pointer  size) . This  is  because  each  dummy 


134 


record,  as  defined  in  3.1.11,  consist  of  four  address 
occurrences.  The  total  size  of  these  three  files  is  then 
+ * ARL  + r_.  * DRSZ  and  p^{R_.)  is  given  by: 

PBSZ 


P3(Rj)  max  [1  - (a,+p,j  * ARL  + r * DRSZ 


, 0] 


3 3 3 

Similarly,  for  p_(R  ) and  p'(R.)  in  the  case  of  imple- 

mentation  21  for  R^ (A,  B)  only  the  third  file  discussed 
above,  namely  the  file  of  r ^ dummy  records,  must  be  con- 
sidered. p (R.)  and  p'(R.)  are  equal  in  this  case  because 
^ J D 

there  is  only  one  file  of  dummy  records  of  total  size 
r^  * DRSZ  and  they  are  given  by: 

P2(R  > - ppa  ) - max  [1  - r:P^SgRS2  , 0). 

3 

In  the  above  formula,  r^  * DRSZ  is  the  size  of  the  file  of 
dummy  records  of  implementation  21  for  R_.  (A,  B)  and  PBSZ 
is  the  page  buffer  size  as  before.  The  above  probabilities 
are  zero  if  the  entire  file  can  be  stored  in  the  buffer, 
i.e.,  if  PBSZ  > rj  * DRSZ. 

The  values  of  PBSZ,  DRSZ,  a'.,  B and  r.  used  in  com- 

3 3 3 

puting  the  page  fault  probabilities  are  fixed  known  value's. 
However,  values  of  ARL  and  LKSZ  are  estimated  values.  We 
1 1 discuss  later  how  these  values  should  be  estimated  and 
how  sensitive  the  results  are  with  respect  to  errors  in  the 
estimation.  We  now  proceed  with  the  derivations  of  time 
cost  fotmulas  for  each  of  the  24  defined  implementations  for 
the  typical  relation  R^ (A,  B) , where  again 


135 


A = ORG  (R  j ) , B = DST  (R  , 

a j = |A|  , pt  = | B | , 

ir* j = average  number  of  B-values  related  to  one  A-value 

in  R . , 

3 

= average  number  of  A-values  related  to  one  B-value 

in  R . . 

3 

We  also  recall  that  SOP  denotes  the  set  of  operation  codes 
defined  as  follows:  SOP  = {FIND  FIRST,  FIND  LAST,  FIND  ITH, 

FIND  NEXT,  FIND  PREV. , FIND  ALL,  FIND  KEY,  STORE  FIRST, 

STORE  LAST,  STORE  KEY,  MODIFY,  ERASE}. 


3.4.2  Time  Cost  of  Implementations  1,  3,  5 and  7 

Implementations  1,  3,  5 and  7 for  R^ (A,  B)  as 
mentioned  before  store  all  B-values  related  to  a specific 
A-value,  say  a,  in  the  record  in  which  a is  stored.  This 
implies  that  once  the  record  that  contains  a is  accessed 
all  B-values  related  to  it  in  R^ (A,  B)  are  available  with- 
out any  more  record  accesses  required.  For  example,  an 
operat  ion  such  as  •:  FIND  ALL,  13,  A,  R^(A,  B)  , f>  in  a run 
unit  will  require  f record  accesses.  Each  record  access 
results  in  a page  fault  with  probability  giving  an 
. ji  expected  page  fault  rate  of  f * p^ . 

The  same  formula  holds  for  all  operation  coles  op  e SOP. 
On  the  other  hand,  for  an  operation  such  as 

<FIND  ALL,  A,  B,  R ^ (A,  B) , f>,  all  n^  A-values  related  to  a 
specific  B-value  are  to  be  accessed.  The  total  record 
access  rate  is  f * n.  and  the  expected  page  fault  rate  is 


136 


f * . For  other  operations,  namely  operations  such 

as  <op.  A,  B,  Rj (A,  B) , f>  with  op  s SOP  - {FIND  ALL}  the 
expected  page  fault  rate  is  f * p^  As  far  as  the  privacy 
decisions  are  concerned,  two  locks  (one  for  each  of  the 
data  item  types  A and  B)  must  be  accessed  and  checked  at 
every  operation  execution  with  each  access  having  a proba- 
bility p4  of  resulting  in  a page  fault. 

The  overall  time  cost  of  the  above  implementations 
(i.e.  MPL  = 1,  3,  5,  7)  for  R.  (A,  B)  in  0^  ( 2.-th  operation 
of  k-th  run  unit)  is  given  by  the  following. 

TC (MPL,  j,  0*)  = fj(px  + 2p4) 

if  0*  * cop,  B,  A,  R j , f*> 

" f£("j  ?!  + 2p4} 

if  0*  = cop,  A,  B,  Rj,  fp 

and  op  = FIND  ALL 


= fi(Pl  + 2P4) 

if  0*  = cop.  A,  B,  R , f*> 
and  op  e SOP  - {FIND  ALL} 
= 0 otherwise. 


3.4.3  Time  Cost  of  Implementations  2,  4,  6 and  8 

Implementations  2,  4,  6 and  8 for  Rj (A,  B)  as 
described  before,  require  that  all  A-values  related  to  one 
B-value,  say  b,  be  stored  in  the  record  in  which  b is 
stored.  The  time  costs  of  these  implementations,  in  terms 
of  the  expected  page  fault  rates,  can  be  measured  in  a 
manner  analogous  to  those  in  Section  3.4.2.  So  for  the 


137 


■ i 


above  implementations  (i.e.,  for  MPL  = 2,  4,  6,  8)  for 
R . (A,  B)  the  time  cost  in  operation  0^  is  given  by  the 
following. 

TCtMPL,  j,  0*)  = fj(px  + 2p4) 

if  0*  = <op , A,  B,  R j , f*> 

= fj(m.  Pl  + 2p4) 

if  0*  = <op,  B,  A,  R j , f*> 
and  OP  = FIND  ALL 

- ft<Pl  * 2p4> 

if  0*  = <op,  B.  A,  R..,  fj> 
and  op  e SOP  - {FIND  ALL} 

= 0 otherwise. 

3.4.4  Time  Costs  of  Implementations  9 and  10 

As  defined  in  Section  3.1.5  implementation  9 for 
R j (A,  B)  is  called  the  chain  with  next  pointers  association 
of  B under  A in  which  a set  type  in  chain  mode  establishes 
the  relationships  in  R ^ . We  now  present  time  costs  of 
implementation  9 for  R^  for  different  operations. 

First  consider  operations  such  as 
0*  - <op,  B,  A,  R (A,  B) , f*>.  If 

op  t {FIND  FIRST,  STORE  FIRST}  then  for  every  operation 
execution  the  owner  record  of  a set  should  be  accessed 
based  on  its  data  key  value  (the  A-value  presented  at  exe- 
cution time)  at  a cost  of  p^  and  then  the  first  member 
record  of  the  set  must  be  accessed  at  a cost  of  p^(R^). 

There  will  be  three  privacy  locks  to  be  accessed  (locks 


4 


3 

The  total  cost  of  implementation  9 for  R.  in 

k k 

= <op,  B,  A,  R j , f^>  is  then 

TC (9 , j,  0*)  = fj(px  + P3 (R j ) + 3p4)  for 

op  e {FIND  FIRST,  STORE  FIRST}. 

k k 

Similarly  for  0^  = <op,  3,  A,  R_.  (A,  B)  , f ,>  and 

op  e {FIND  LAST,  STORE  LAST,  FIND  ALL},  accessing  the 
owner  record  costs  p-^, accessing  the  first  member  record 
costs  p3(Rj)  and  then  (m^  - 1)  member  records  must  be  tra- 
versed to  hit  the  last  member  record  which  costs 

(in.  - 1 ) p-  ( R . ) . Again  there  are  three  locks  to  be 
3 z 3 

accessed  at  a cost  of  3p4 . The  total  time  cost  is  then 
given  by  the  following  formula. 

TC ( 9 , j,  0*)  = fj(px  + P3 (Rj)  + (Sj  - l)p2(Rj)  + 3p4) 

)c  ]( 

for  0^  = <op,  B,  A,  R_.  , f^>  and 

op  e {FIND  LAST,  STORE  LAST, 

FIND  ALL} 

For  an  operation  of  the  form 
0*  = <op , B,  A,  R.(A,  B) , f*>  where  op  = FIND  NEXT,  at 
each  execution  of  the  operation  the  next  B-value  related 
in  Rj  to  a specific  value  a of  type  A is  to  be  retrieved. 
The  assumption  here  is  that  the  owner  record  has  been 
accessed  and  a current  member  record  has  been  established 
in  a previous  operation,  and  now  the  next  member  record 
with  respect  to  this  current  one  is  to  be  retrieved.  This 
record  access  would  cost  p2(R^)  for  every  execution  of  the 
operation . 


139 


I 


» 


The  locks  needed  to  be  accessed  for  privacy  decisions 
are  those  for  the  relation  R ^ and  data  item  type  B at  a 
total  cost  of  2p^ . 

The  overall  cost  of  implementing  by  implementation 
9 for  operation  0*  = <FIND  NEXT,  B,  A,  R ^ (A,  B) , f*>  is  as 
follows . 

TC ( 9 , j,  0*)  = fJ(P2(Rj)  + 2p4) 

We  now  consider  an  operation  such  as 
0*  = <FIND  PREV.,  B,  A,  R^  (A,  B) , f^>.  This  operation 
requires  the  retrieval  of  the  "prior"  B-value  related  to  a 
specific  A-value  (like  a) . Here  again  we  assume  that  a 
certain  member  record  has  been  retrieved  and  established 


as  the  current  member  record  of  the  set  in  a previous  oper- 
ation, and  now  the  prior  member  record  with  respect  to  the 
current  one  is  to  be  retreived.  Recall  that  in  implementa- 
tion 9 only  next  pointers  are  provided.  Retrieval  of  a 
prior  member  record,  then,  requires  i)  the  traversal  of  the 
chain  of  pointers  in  the  records  that  follow  the  current 
one  until  the  last  one  is  hit,  ii)  retrieving  the  owner 
record,  iii)  retrieving  the  first  record  in  the  chain  and 
iv)  the  travel  sal  of  the  chain  of  pointers  from  the  first 
member  record  to  the  one  prior  to  the  current  record.  The 


traversals  in  i)  and  iv)  require  a total  of 
retrievals  at  a cost  of  p^fR^)  per  retrieval 
retrievals  of  ii)  and  iii)  cost  p^(R,)  each, 
retrieving  the  privacy  locks  will  be  3p^. 

So  for  Oj  = <FIND  PREV.,  B,  A,  R . (A,  B) 


(rrij  - 2)  record 
, record 

The  cost  of 

)c 

, f^>  we  have 


140 


the  following. 

TC(9,  3,  O*)  = f £ (P3  (Rj  ) + P3  (Rj  > + (5L-2 ) p2  (Rj  ) + 3p4 ) 

The  time  cost  of  implementation  9 for  Rj (A,  B)  in 

v 

O „ = <op,  B,  A,  R . ( A,  B) > will  now  be  given  for 
4 3 

op  t {FIND  ITH,  FIND  KEY,  STORE  KEY,  MODIFY,  ERASE}. 

For  op  = FIND  ITH  the  i-th  B-value  related  (in  R j ) to 
a specific  A-value  is  to  be  retrieved.  Since  implementa- 
tion 9 is  a chain  with  next  pointer  association,  this  means 
that  in  a certain  set  in  the  data  base  the  i-th  member 
record  is  to  be  retreived.  The  value  of  i will  be  provided 
at  execution  time  and  it  can  be  any  value  from  1 to  m_.  with 

j 

equal  probabi 1 ity , and  so  on  the  average  it  is 
(in_.  (m_.  + l)/(2iv.j  ))  = 1 + (in ^ —1/2 ) . In  order  to  retrieve  the 
i-th  member  record,  then,  it  is  necessary  to  access  the 
owner  of  the  set  based  on  its  data  key  value  at  a cost  of 
p^,  then  access  the  first  member  record  at  a cost  of  p.j(Rj) 
and  finally  traverse  a chain  of  (m^-l)/2  pointers  at  a cost 
of  ({ttk-1  ) /2)  p2  (R_. ) . Again  accessing  the  three  locks 
required  for  privacy  decisions  would  cast  3p4-  Similar 
arguments  can  be  given  for 

op  e {FIND  KEY,  STORE  KEY,  MODIFY,  ERASE}  and  hence  we  have 
the  following. 

. m.-l 

TC(9,  j,  0\)  = f*(Pl  + P3(R.)  + -Jj—  P2(Rj)  + 3Pg ) 
for  = <op,  B,  A,  Rj (A,  B) , f^> 

and  op  e {FIND  ITH,  FIND  KEY,  STORE  KEY, 
MODIFY,  ERASE}. 


141 


We  have  presented  so  far  the  time  costs  of  implementa- 
tions 9 for  R^(A,  B)  in  all  possible  operations  of  type 
)c  K 

= <0P'  Rj » We  now  consider  operation  of 

type  0t  = <op,  A,  B,  R ^ , fp. 

We  recall  that  implementation  9 for  R ^ (A,  B)  is  the 

chain  with  next  pointers  association  of  B under  A (as 

defined  in  3. 1.5) in  which  A-values  are  stored  in  owner 

records  of  some  set  type.  The  operations  of  the  above  form 

k 

(i.e.,  <op,  A,  B,  R . , f >)  are  then  basically  owner  record 

3 *> 

accesses  from  some  member  record  of  a set.  Since  all  member 
records  of  a set  are  related  to  only  one  owner  record, 
time  costs  of  implementation  9 for  R ^ is  the  same  for  all 
op  e SOP.  This  time  cost  consists  of  1)  a member  record 


access  through  its  data  key  value,  ii)  traversal  of  a chain 
of  pointers  in  the  member  records  and  iii)  accessing  the 
owner  record  from  the  last  member  record.  For  every  oper- 
ation execution  the  record  accesses  in  i)  and  iii)  cast  p^ 
and  p^fRj)  (in  expected  page  fault  rate),  respectively. 

The  member  record  accessed  in  i)  is,  on  the  average,  loca- 


ted half  way  down  the 
the  traversals  in  ii) 


chain  of  m. 

3 


would  cast 


member  records  and  hence 
2 (rrm-1)  p^  (R^ ) in  expected 


page  fault  rate.  And  as  before  privacy  lock  accesses,  for 


locks  on  A,  B and  R^  will  cast  3p^. 

The  total  cost  of  implementation  9 for  R^  in  opera- 
k k 

tions  of  type  0^  - <op,  A,  B,  R.(A,  B) , f ^>) , op  e SOP  is 


given  by  the  following. 


m . -1 


TC(9,  j,  0*)  = f*(Pl  + "V"  P2(R3>  + P3<Rj)  + 3P4> 


142 

Following  is  a summary  of  formulas  derived  in  this 
subsection  for  time  costs  of  implementation  9 for  (A,  B) 
in  0*. 

TC(9,  j,  oj)  = fJ(Pl  + p3(Rj)  + 2p4) 

if  0*  = <op,  B,  A,  R ^ , f^> 

and  op  e {FIND  FIRST,  STORE  FIRST} 

= f^(px  + p3(R.)  + ( rn  j — 1 ) P2  (Rj  ) + 3p4) 
if  0*  = <op , B,  A,  Rj,  f*> 
and  op  e {FIND  LAST,  STORE  LAST, 
FIND  ALL} 

= f£(P2(Rj)  + 2p4) 

if  0*  = <FIND  NEXT,  B,  A,  R_.  , f*> 

* (2p3  (Rj  ) + (ftK-2)p2  (R_.)  + 3p4 ) 

if  0*  = <FIND  PREV.,  B,  A,  R , f*> 

k "’i-1 

= **(?!  + P3(Rj)  + P2(Rj)  + 3p4) 

if  0^  = <op , B,  A,  R , fj> 

and  op  e {FIND  ITH,  FIND  KEY, 

STORE  KEY,  MODIFY,  ERASE } 

k "V1 

= fjl(p1  + --jj—  P2(Rj)  + P3(Rj)  + 3p4) 
if  0*  = <op,  A,  B,  R ^ , f*> 
and  op  r SOP 

- 0 otherwise. 

As  described  in  3.1.5,  implementation  10  for  R_.  (A,  B) 
is  the  chain  with  next  pointer  association  of  A under  B. 
Since  this  implementation  is  the  converse  of  9,  deriva- 
tions of  time  cost  equations  for  it  are  analogous  to  those 


143 


presented  above  for  implementation  9 and  so  the  details  will 
not  be  repeated.  The  time  cost  of  implementation  10  for 

|r 

Rj (A,  B)  in  operation  is  as  follows. 

TC  (10,  j,  o\)  = f*(Pl  + P3(Rj)  + 3pg) 

if  0*  = <op.  A,  B,  R j , f*> 

and  op  e {FIND  FIRST,  STORE  FIRST} 

= f£(Px  + P3  (Rj)  + (Sj-Dp^Rj)  + 3p4 ) 
if  O*  = <op , A,  B,  R j , f*> 
and  op  e {FIND  LAST,  STORE  LAST, 

FIND  ALL} 

= ^(p'fRj)  + 2p4) 

if  O*  = <FIND  NEXT,  A,  B,  R j , f*> 

= f £ (2p3  (Rj ) + (nj-2)p^(Rj)  4-  3p4) 

if  O*  = <FIND  PREV.,  A,  B,  R , f£> 

k Ri_1 

= fJ(Pl  + P3  (Rj ) + P^  (Rj)  + 3p4) 

if  O*  = <op , A,  B,  R^,  fj> 

and  op  e {FIND  ITH,  FIND  KEY, 

STORE  KEY,  MODIFY,  ERASE} 

k "i"1 

* f*^  + ~^2~  P2(Rj>  + P3(Rj>  + 3p4) 
if  0*  = <op , B,  A,  Rj,  fp 
and  op  e SOP 

= 0 otherwise. 


3.4.5  Time  Cost  of  Implementations  11  and  12 

Implementation  11  for  Rj (A,  B) , defined  in  3.1.6  is 
called  the  chain  with  next  and  owner  pointers  association 


144 


of  B under  A.  Time  costs  of  this  implementation  differ 
from  those  of  implementation  9 only  where  owner  pointers 
(in  member  records)  are  used  in  a series  of  record  accesses. 
The  owner  pointers  are  used  in  two  cases.  The  first  case 
is  for  an  operation  such  as 

0*  = <FIND  PREV. , B,  A,  R^  (A,  B) , f^>  where  the  prior  (with 
respect  to  the  current)  member  record  of  the  set  (that 
implements  R^)  is  accessed.  We  assume  that  the  current 
member  record  is,  on  the  average,  half  way  down  the  chain 
of  member  records.  The  above  operation,  then,  requires  one 
record  access  to  the  owner  record  of  the  set  with  a page 
fault  probability  of  p.j(Rj),  one  record  access  from  the 
owner  to  the  first  member  at  a cost  of  p.j(Rj)  and  (nK-l)/2 
record  traversals  in  the  chain  of  member  records  at  a cost 

of  ((m.-l)/2)p_(R.)  . 

J z J 

The  cost  of  accessing  privacy  locks  is  3p^  as  for 
implementation  9. 

The  second  case  in  which  time  cost  of  implementation 
11  is  different  from  that  of  implementation  9 is  for 

Jr 

operations  such  as  <op.  A,  B,  R.(A,  B)  , f„‘>,  for  op  c SOP. 

In  these  types  of  operations,  each  execution  of  the  opera- 
tion requires  one  record  access,  through  data  key  values, 
to  the  member  record  that  contains  the  specific  B-value  and 
then  an  owner  access  using  the  owner  pointer  in  the  member 
record.  These  two  record  accesses  cost  p^  and  p^(R^), 
respectively,  in  terms  of  expected  page  fault  rate. 

Again,  accessing  the  privacy  locks  would  cost  3p^ . 


▲ 


145 


The  overall  time  cost  of  implementation  11  for  R . (A,  B) 

• j( 

in  operation  0^  is  as  follows. 

TC(11,  j,  0*)  = f^(P;i  + P3(Rj)  + 3p4) 

if  0*  = <op , B,  A,  R..,  f*> 

and  op  e {FIND  FIRST,  STORE  FIRST} 

= + P3(Rj)  + (^-1)  P2(Rj)  + 3p4) 

if  0*  = <op , B,  A,  R j , f*> 
and  op  e {FIND  LAST,  STORE  LAST, 

FIND  ALL} 


= f £ (P2 (Rj ) + 2p4) 

if  0*  = <FIND  NEXT,  B,  A,  R , f*> 

k "V1 

= f £ (2p3  (Rj  ) + P2(Rj)  + 3P4) 

if  0*  = <FIND  PREV.,  B,  A,  R^ , f*> 

" f£(Pl  + P3(Rj}  + P2(Rj}  + 3P4> 

if  0£  = <op,  B,  A,  R j , f^> 
and  op  e {FIND  ITH,  FIND  KEY, 

STORE  KEY,  MODIFY,  ERASE} 

- f5'Pl  + P3(V  + 3P4> 

if  0*  = <op.  A,  B,  R^.,  f£> 
and  op  e SOP 

= 0 otherwise. 

Implementation  12  for  R^ (A,  B)  was  defined  in  3.1.6 
and  it  is  the  chain  with  owner  pointer  association  of  A 
under  B.  Time  costs  of  this  implementation  are  analogous 
to  those  for  implementation  11,  presented  in  3.4.5. 

Following  is  the  time  cost  of  implementation  12  for 


. v. 


146 


R . (A,  B)  in  operation  0^. 

TC  (12,  j,  0*)  = fJ(Pl  + P3(Rj)  + 3p4) 

if  O*  = <op , A,  B,  Rj,  f*> 
and  op  e {FIND  FIRST,  STORE  FIRST} 
= ffcfP-L  + P3  (Rj ) + (nj-1)  p£(Rj)  + 3p4) 
if  O*  = <op , A,  B,  R^.,  f£> 
and  op  e {FIND  LAST,  STORE  LAST, 
FIND  ALL} 

- fJtPPV  + 2p4> 

if  Oy  = <FIND  NEXT,  A,  B,  R.,  f*> 

3 *■ 

= f £ ( 2p3 (Rj ) + | (fij-1)  p’(Rj)  +3p4) 

if  0*  = <FIND  PREV.,  A,  B,  R j , f*> 

- f^Pl  + P3(Rj}  + J P2(Rj}  + 

+ 3p4) 

if  O*  = <op.  A,  B,  Rj,  f*> 
and  op  e {FIND  ITH,  FIND  KEY, 

STORE  KEY,  MODIFY,  ERASE} 
= fJ(Pl  + P3(Rj)  + 3p4 ) 

if  0*  = <op,  B,  A,  Rj,  f*> 
and  op  e SOP 

= 0 otherwise . 

3.4.6  Time  Cost  of  Implementations  13  and  14 

Implementation  13  for  Rj (A,  B) , defined  in  3.1.7  is 
called  the  chain  with  prior  pointer  association  of  B 


under  A.  Time  costs  of  this  implementation  differ  from 
those  of  implementation  9 in  two  cases  where  the  prior 


147 


pointers  are  used  in  a series  of  record  accesses. 

The  first  case  in  which  a prior  pointer  is  used  is  for 

k k 

an  operation  like  0^  = <op,  3,  A,  K ^ (A,  B) , f^>  where 

op  & {FIND  LAST,  STORE  LAST}.  This  is  when  the  last  record 

of  a set  is  to  be  accessed.  The  prior  pointer  in  the  owner 

record  in  this  case  points  to  the  last  member  in  the  chain 

and  it  is  used  to  access  the  last  record  at  a cost  of 

P-j  (R  • ) in  expected  page  fault  rate.  Of  course,  the  owner 
^ J 

record  should,  like  before,  be  first  accessed  through  its 

data  key  value  at  a cost  of  p^ . The  total  cost  of  imple- 

v 

mentation  13  for  R.  in  0.  will  then  be: 

3 * 

TC (13,  j,  0*)  = f*(Pl  + p3(R.)  + 3p4) 

if  0*  = <op , B,  A,  R f*> 
and  op  e {FIND  LAST,  STORE  LAST}. 
The  second  case  where  the  cost  of  implementation  13  is 

different  from  that  of  implementation  9 for  R ^ (A,  B)  is 

k k 

for  an  operation  such  as  0^  = <F1ND  PREV.  , B,  A,  R^  , f^>. 

This  operation  requires  the  retrieval  of  the  "prior"  member 

record  (with  respect  to  a previously  established  "current" 

member  record)  of  the  set  that  implements  R ^ . At  each 

execution  of  0^,  then,  a single  record  access,  using  the 

prior  pointer,  is  sufficient  to  locate  the  previous  member 

record.  This  record  access  will  cost  p2(F'j)  in  expected 

page  fault  rate.  The  time  cost  of  implementation  13  for 

R . (A,  B)  in  0*  = <FIND  PREV.,  B,  A,  R . , f*>  is  as  follows. 

3 * 3 * 

TC  ( 1 3 , j,  0*)  = f*(p2(R.)  + 2p4). 

The  complete  time  costs  of  implementation  13  for 


148 


I 

Jr 

Rj (A,  B)  in  0^  is  given  below. 

TC (13,  j,  Ok)  - fk(px  + P3(Rj)  + <mj-l)p2(Rj)  + 3p4) 

if  0k  * <FIND  ALL,  B,  A,  R_.  , fk> 

= f*(Px  + P3 (Rj ) + 3p4) 

if  0*  = <op , B,  A,  R..,  fk> 
and  op  e {FIND  FIRST,  STORE  FIRST, 
FIND  LAST,  STORE  LAST} 

- f£<P2<V  + 2P4> 

if  0*  = <op , B,  A,  R j , fk> 

and  op  e {FIND  NEXT,  FIND  PREV. } 

= f*(Pl  + P3(Rj)  + |(mj-l)p2(Rj)  + 3p4) 
if  Ok  = <op , B,  A,  R , fj> 
and  op  e {FIND  ITH,  FIND  KEY, 

STORE  KEY,  MODIFY,  ERASE] 

“ f£(Pl  + ^3  (Rj } + 3P4  + ~^T~  P2(Rj)} 

if  0*  = <op , A,  B,  R , fk> 
and  op  t SOP 

= 0 otherwise. 

Implementation  14  for  R^ (A,  B) , as  defined  in  3.1.7  is 
called  the  chain  with  prior  pointers  association  of  A 
. under  B.  Time  costs  of  this  implementation  are  analogous 

to  those  for  implementation  13  presented  above. 

Following  is  the  time  cost  of  implementation  14  for 

)r 

R_.  (A,  B)  in  operation  O^. 


149 


TC  (14,  j,  O*)  = f *(?1  + p3(Rj)  + (H_.-l)p£(R  J + 3p4) 

if  O*  = <FIND  ALL,  A,  B,  R , f*> 

= f^(px  + P3 (Rj ) + 3p4 ) 

if  O*  = <op , A,  B,  R y f*> 

and  op  e {FIND  FIRST,  STORE  FIRST, 


FIND  LAST,  STORE  LAST} 

- fJ<P2<V  + 2p4 ' 

if  Oj  » <op.  A,  B,  R , f£> 


and  op  e {FIND  NEXT,  FIND  PREV. } 

(P1  + P3 (Rj } + |(nj-l)p^(Rj)  + 3p4) 
if  O*  = <op,  A,  B,  R f*> 
and  op  e {FIND  ITH,  FIND  KEY, 

STORE  KEY,  MODIFY,  ERASE} 
n . -1 

(px  + P3  ( R j ) + 3p4  + P2  (Rj  ) ) 

if  0^  = <op,  B,  A,  fj> 

and  op  e SOP 


= 0 


otherwise. 


3.4.7  Time  Cost  of  Implementations  15  and  16 

Implementation  15  for  Rj (A,  B)  as  defined  in  3.1.8 
is  called  the  chain  with  next, prior  and  owner  pointers  asso- 
ciation of  B under  A.  Time  costs  of  this  implementation 
are  different  from  those  for  implementation  9 only  when 
owner  or  prior  (or  both)  pointers  are  used  in  a series  of 
record  accesses.  Owner  pointers  are  used  in  operations  of 

lr  )r 

the  form  0}  = <op,  A,  B,  Rj (A,  B) , f^>  with  op  e SOP  and 
prior  pointers  are  used  in  operations  such  as 


150 


O*  = <op,  B,  A,  Rj (A,  B) , f^>  with 

op  e {FIND  LAST,  STORE  LAST,  FIND  PREV.}.  The  time  cost  of 
implement:  at  ion  15  for  R_.(A,  B)  in  C*  is  then  as  follows. 

TC  (15 , j,  O*)  = f^^  + P3(Rj)  + (m^-1)  p2  (R_.  ) + 3p4) 

if  O*  = <FIND  ALL,  B,  A,  R_.  , f*> 

= + P3  (Rj ) + 3p4) 

if  0*  = <op,  3,  A,  R , t\> 
and  op  t {FIND  FIRST,  STORE  FIRST, 
FIND  LAST,  STORE  LAST} 

= f*(p2(R.)  + 2p4) 

if  O^  = <op,  B,  A,  Rjf  f*> 

and  op  e {FIND  NEXT,  FIND  PREV.  } 

= f^(P3  + P3(Rj)  + \ (m.,-1)  p2  (Rj ) + 3p4) 
if  0*  = <op,  B,  A,  R_.  , f*> 
and  op  e {FIND  ITH,  FIND  KEY, 

STORE  KEY,  MODIFY,  ERASE } 

= f£<Px  + P3 (R j ) + 3p4) 

if  O*  = <op , A,  B,  Ry  f*> 
and  op  e SOP 

= 0 otherwise. 

Implementation  16  for  R..  (A,  B) , as  defined  in  3.1.8, 
is  the  chain  with  next,  prior  and  owner  pointers  association 
of  A under  B.  This  implementation  is  the  converse  of 
implementation  15  and  therefore  formulas  for  computing  time 
costs  for  the  implementation  are  analogous  to  those  for 
implementation  15  presented  above. 

Time  costs  of  implementation  16  for  R . (A,  B)  in  0.  is 

J * 


151 


given  by  the  following. 

TC  ( 16 , j,  0*>  = f^(p1  + P3(Rj)  + (i^-Up^R  ) + 3p4) 

if  O*  = <FIND  ALL,  A,  B,  , f*> 

* fl'Pj  + P3<Rj>  + 3p4' 

if  = <op , A,  B,  R j , f*> 
and  op  e {FIND  FIRST,  STORE  FIRST, 
FIND  LAST,  STORE  LAST} 

- f£(p^(Rj)  + 2p4) 

if  O*  = <op.  A,  B,  Rj , f*> 

and  op  e {FIND  NEXT,  FIND  PREV. } 

= f\(P1  + P3(Rj)  + \ (nj-Dp^R.)  + 3p4 ) 
if  O*  = <op , A,  B,  R , f*> 
and  op  e {FIND  ITH,  FIND  KEY, 

STORE  KEY,  MODIFY,  ERASE} 

= fj(px  + P3(Rj)  + 3p4) 

if  O*  = <op , B,  A,  R j , f{> 
and  op  e SOP 

= 0 otherwise. 

3.4.8  Time  Cost  of  Implementations  17  and  18 

Implementation  17  for  R^ (A,  B)  was  described  in  3.1.9 
and  is  called  the  pointer  array  association  of  B under 
A.  The  pointer  array  of  each  set  is  assumed  to  be  stored 
in  the  page  in  which  the  owner  record  of  the  set  is  stored 
so  the  retrieval  of  the  pointer  array  is  accomplished  at 
no  extra  cost  once  the  owner  record  is  accessed.  In  the 
following  we  will  first  present  the  time  cost  of  implemen- 


152 


tation  17  for  R ^ in  operations  of  the  form 

k 

<op,  A,  B,  R.,  f»>  and  then  we  will  consider  operations  of 
J *■ 

v 

the  form  <op,  B,  A,  R.f  f.>. 

3 

J^ 

Operations  such  as  <op.  A,  B,  R.,  f.>  require  owner 

3 x. 

record  accesses  when  implementation  17  is  selected  for  R^  . 
As  defined  in  3.1.9,  owner  pointers  are  not  provided  in  the 
member  records  of  the  pointer  array  associations  and  there- 
fore owner  record  accesses  are  not  possible  in  this  type 
of  implementation.  We  associate  a time  cost  of  infinity 
with  implementation  17  in  the  above  operations,  i.e., 

TC (17,  j,  0*)  = » if  oj  = <op , A,  B,  Rj (A,  B) , fj> 

and  op  e SOP . 

Jr  , 

An  operation  such  as  <FIND  ALL,  B,  A,  R.,  f„>  requires 

3 *■ 

i)  one  record  access  to  the  owner  record  of  the  set  (that 
contain  the  specified  A-value)  and  ii)  iik  accesses  to 
member  records  to  retrieve  all  B-values  related  to  that 
A-value  in  R^ . The  record  access  in  i)  is  an  access 
through  the  data  key  value  and  costs  p^  in  expected  page 
fault  rate.  The  m ^ record  accesses  in  ii)  are  related 
record  accesses  (owner  to  member)  at  a total  cost  of 
fir  p3(Rj.  For  making  the  proper  privacy  decisions  three 
locks  have  to  be  accessed  and  checked,  namely  those  for 


A,  B and  R ^ . 

The  total  cost  of  implementation  17  for  R ^ (A,  B)  in 
0*  = <FIND  ALL,  B,  A,  R^ , f*>  is  then, 

TC (17,  j,  0*)  = fJ(Pl  + m.  P3 (R j ) + 3p4). 


153 


We  now  consider  the  operation  <FIND  KEY,  B,  A,  R.,  f*r> 

3 

which  requires  the  retrieval  of  a record  containing  a 
specific  B-value  related  to  a given  A-value  in  (A,  B) . 

The  record  access  required  by  this  operation  are  i)  an 
owner  record  access  through  the  data  key  value  (the  A-value), 

ii)  an  average  of  j member  record  accesses,  using  the 
pointers  in  the  pointer  array  until  the  member  record  that 
contains  the  desired  B-value  is  hit  and  iii)  three  accesses 
to  retrieve  privacy  locks.  Record  accesses  in  i) , ii)  and 

iii)  have  time  cost  of  p^,  j m_.  p^(R_.)  and  3p^ , in  expected 
page  fault  rate,  respectively. 

It  can  similarly  be  shown  that  the  same  time  costs 
result  for  every  op  e {FIND  KEY,  STORE  KEY,  MODIFY,  ERASE) 


and  therefore  the  time  cost  of  implementation  17  for 

l . 

3 - * 3 


k k 

R.. (A,  B)  in  0^  = <op,  B,  A,  R.,  f^>  is  as  follows. 


TC  ( 17 , j,  0*)  = f*(Pl  + \ m.j  P3(Rj)  + 3p4) 

if  op  e {FIND  KEY,  STORE  KEY, 
MODIFY,  ERASE). 


The  next  set  of  operations  to  be  considered  is  the 

R k 

set  of  operations  such  as  0,,  - <op,  B,  A,  R ^ , f^>  with 
op  c {FIND  FIRST,  STORE  FIRST,  FIND  LAST,  STORE  LAST, 

FIND  ITU).  For  example,  operation  <FIND  ITH,  B,  A,  R ^ , f^>, 
at  each  operation  execution,  requires  the  retrieval  of  the 
i-th  B-value  related  to  a specific  A-value,  say  a,  in  R ^ . 

The  values  of  i and  a will  be  provided  at  execution  time. 
Each  execution  of  the  operation  requires  one  owner  record 
access,  through  data  key  value  to  locate  the  owner  record 


154 


that  contains  a.  Once  this  record  is  located  the  array  of 
pointers  to  member  records  is  available.  The  i-th  element 
of  the  array  contains  the  address  of  the  i-th  member  record 
which  in  turn  contains  the  desired  B-value.  So  two  record 
accesses  are  needed  that  cost  and  p^(R^)  in  expected  page 
fault  rate,  respectively.  So  we  have  the  following. 

TC (17,  j,  0*)  = fJ(Pl  + P3(Rj)  + 3p4) 

if  0*  = <op , B,  A,  Rj,  f*> 
and  op  e {FIND  ITH,  FIND  FIRST, 

FIND  LAST,  STORE  FIRST, 
STORE  LAST}. 


The  last  two  operations  to  be  considered  are 

<FIND  NEXT,  3,  A,  R.,  f*>  and  <FIND  PREV. , B,  A,  R.,  f*>. 

3 x,  3 * 

In  these  two  cases  the  operation  requires  the  retrieval  of 
the  next  (or  prior)  member  record  (with  respect  to  some 
previously  established  current  member  record)  of  the  set. 
Member  records  of  a set  in  pointer  array  mode,  as  defined 
in  Section  3.1,  do  not  contain  any  pointers  to  either 
the  owner  record  or  any  other  member  record  of  the  set. 

This  makes  it  impossible  (at  least  extremely  hard)  to 
process  the  FIND  NEXT  and  FIND  PREV.  operations  and  there- 
fore we  associate  a time  cost  of  infinity  to  implementation 

V 

17  for  R.  in  0.  = <op,  B,  A,  R.(A,  3),  f,>  where 
j k r j l 

op  t {FIND  NEXT,  FIND  PREV.}. 


Rj  (A, 


The  complete  time  cost  of  implementation  17  for 

k 

B)  in  operation  0^  is  as  follows. 


155 


TC (17,  j,  O*)  = fJ(Pl  + m.  p3(R.)  + 3p4) 


if  0,  = <FIND  ALL,  B,  A,  R . , f„> 


1 - 


= f4(Pl  + f m.  P 3 ( R j ) + 3p^ 


) 


if  0*  = <op,  B,  A,  R j , fj> 
and  op  e {FIND  KEY,  STORE  KEY, 
MODIFY,  ERASE) 

= f^Px  + P3(Rj)  + 3P4) 

if  0*  - <op , B,  A,  R j , f*> 
and  op  e {FIND  ITH,  FIND  FIRST, 

STORE  FIRST,  FIND  LAST, 
STORE  LAST) 

= 00  if  0*  = cop,  B,  A,  R j , fj> 

and  op  e {FIND  NEXT,  FIND  PREV. } or 
0*  = cop.  A,  B,  R , f*> 
and  op  e SOP 

= 0 otherwise. 

Implementation  18  for  R^ (A,  B)  was  defined  in  3.1.9 
as  the  pointer  array  association  of  A under  B.  Since  this 
implementation  is  the  converse  of  implementation  17  for 
Rj (A,  B) , derivations  of  time  cost  equations  for  it  are 
analogous  to  those  presented  above  and  hence  only  the 
results  are  given  below. 

Time  cost  of  implementation  18  for  R.(A,  B)  in  opera- 
tion  0^  is  as  follows. 


156 


TC (18,  j,  Ok)  = f*(pL  + n p3(R  ) + 3p4) 

if  O*  = <FIND  ALL,  B,  A,  R ^ , f£> 

= fk(Pl  + | n.  p3(Rj)  + 3p4) 

if  Ok  = cop,  B,  A,  R j , fk> 
and  op  e {FIND  KEY,  STORE  KEY, 
MODIFY,  ERASE} 

= f£<P!  + P3(Rj)  + 3P4) 

if  0*  = cop,  B,  A,  Rjf  fj> 
and  op  e {FIND  ITH,  FIND  FIRST, 

STORE  FIRST,  FIND  LAST, 
STORE  LAST} 

= 00  if  Ok  = cop,  B,  A,  R j , fj> 

and  op  e {FIND  NEXT,  FIND  PREV. } or 
Ok  = cop.  A,  B,  R j , fk> 
and  op  e SOP 

= 0 otherwise. 


3.4.9  Time  Cost  of  Implementation  19  and  20 

Implementation  19  for  R..  (A,  B)  as  defined  in  3.1.10 
is  called  the  pointer  array  with  owner  pointer  association 
of  B under  A.  It  is  similar  to  implementation  17  except 
that  in  each  set  of  the  association  member  records  have  an 
extra  pointer  field  that  contains  the  address  of  the  owner 
record.  Time  costs  of  implementation  19  are  different  from 


those  of  implementation  19  for  R ^ (A,  B)  only  in  operations 


that  require  the  use  of  these  owner  pointers  in  a series  of 


record  traversals. 


157 


k k 

In  an  operation  such  as  0,  = <op,  B,  A,  R.(A,  B) , f^> 

and  where  op  t {FIND  NEXT,  FIND  PREV.},  the  next  (or  prior) 

member  record  of  a set  is  to  be  accessed  from  the  current 

member  record.  The  operation  (at  each  execution)  requires 

one  record  access  to  the  owner  of  the  set  at  a cost  of 


p^(Rj)  and  a subsequent  member  record  access  (to  the  prior 
or  next  record)  again  at  a cost  of  p3(R^).  Accesses  to 
privacy  locks  will  cost  3p4  since  locks  on  data  items  A and 
B,  and  relation  R ^ should  be  retrieved  and  checked. 

The  total  time  cost  is  then  as  follows. 

TC  (19,  j,  o\)  = fJ(2p3(Rj)  + 3p4) 

if  o£  = <op , B,  A,  Rj (A,  B),  fj> 

and  op  e {FIND  NEXT,  FIND  PREV. } 

k k 

Operations  of  the  form  of  = <op.  A,  B,  R^ (A,  B) , f^> 
where  op  e SOP  are  basically  owner  record  accesses  from  a 
member  record  (accessed  through  its  data  key  value)  when 


implementation  19  is  chosen  for  R^ (A,  B) . Each  execution 
of  the  operation  requires  one  record  access  to  the  member 
record  that  contains  a specific  B-value  and  one  record 
access  (using  the  owner  pointer  in  the  member  record)  to 


the  owner  record  of  the  set  (that  implements 


V- 


These 


two  record  accesses  cost  p^  and  p^(Rj),  in  expected  page 
fault  rate,  respectively.  So  we  have  the  following. 

TC  ( 19 , j,  0*)  = f*(Pl  + P3(Rj)  + 3p4) 

if  o£  = <op , A,  B,  Rj (A,  B),  f*> 
and  op  e SOP. 


158 


R.  (A, 


The  complete  time  cost  of  implementation  19  for 
B)  in  operation  0^  is  as  follows. 

T0< 19 , j,  0*)  = fJ(Pl  + m.  p3(Rj)  + 3p4) 

if  0*  = <FIND  ALL,  B,  A,  R ^ , f*> 
= f{(Pl  + \ m.  p3(R.)  + 3p4) 

if  0*  = <op,  B,  A,  R^,  f*> 
a..d  op  e {FIND  KEY,  STORE  KEY, 


MODIFY,  ERASE} 

* f£<Pi  + P3(Rj}  + 3P4) 

if  0^  = <op,  B,  A,  R j , f^> 
and  op  e {FIND  ITH,  FIND  FIRST, 

STORE  FIRST,  FIND  LAST, 
STORE  LAST} 

= f£(2p3{R.)  + 3p4) 

if  O^  = <op , B,  A,  R j , f*> 

and  op  e {FIND  NEXT,  FIND  PREV.  } 

= f*(Px  + P3(Rj>  + 3P4) 

if  O*  = <op,  A,  B,  R ^ , f*> 


and  op  t SOP 

= 0 otherwise. 

Implementation  20  for  R_.  (A,  B)  as  defined  in  3.1.10 
is  called  the  pointer  array  with  owner  pointer  association 
of  A under  B.  The  derivations  of  time  cost  equations  for 
this  implementation  are  similar  to  those  for  implementation 
19  for  Rj  and  hence  only  the  results  are  presented  below. 
Time  cost  of  implementation  20  for  R^ (A,  B)  in  opera- 

)r 

tion  is  as  follows. 


mmm 


159 


* 

l 


TC(20,  j,  O*)  = fj(p1  + n.  p3(R.)  + 3p4) 

if  Oj  = <FIND  ALL,  A,  B,  R.,  f*> 

3 x. 

f^(Pl  + j "j  P3^Rj^  + 3p4* 

if  0*  = <op,  A,  B,  R , f*> 

and  op  e {FIND  KEY,  STORE  KEY, 

MODIFY,  ERASE} 

= fJ(Pl  + P3(Rj)  + 3p4) 

if  0*  = <op.  A,  B,  R , fj> 

and  op  e {FIND  ITH,  FIND  FIRST, 

STORE  FIRST,  FIND  LAST, 

STORE  LAST} 

“ fU<2P3<V  - 3p4> 

if  0*  = <op,  A,  B,  R , f*> 

and  op  e {FIND  NEXT,  FIND  PREV. } 

= f^(p3  + p3 (R  ) + 3p4 ) 

if  0*  = <op , B,  A,  R , f*> 
and  op  t SOP 

= 0 otherwise. 

3.4.10  Time  Cost  of  Implementation  21 
Implementation  21  for  R ^ (A,  B) , as  defined  in 

3.1.11  is  the  dummy  record  association  of  A and  B.  We 
first  consider  operations  such  as  <op,  B,  A,  R_.  (A,  B)  , f> 
and  derive  time  costs  of  implementation  21  for  different 
operation  codes,  op,  in  the  set  of  operation  codes  SOP.  We 
will  then  present  the  time  cost  formulas  for  the  converse 
operations  <op,  A,  B,  R^ (A,  B) , f>,  op  e SOP.  We  recall 


that  in  this  implementation  two  set  types  are  defined  where 
A and  B appear  in  the  owner  record  type  of  the  "first"  and 


"second"  set  type,  respectively. 

An  operation  such  as  <FIND  ALL,  B,  A,  R ^ (A,  B) , f> 

requires,  at  each  execution,  the  following  record  accesses. 

One  record  access  to  locate  the  record  that  contains  a 

specific  A-value  at  a cost  of  p^ . This  record  is  the 

owner  record  of  a set  of  the  first  type  with  an  average  of 

nij  dummy  member  records  that  have  to  be  accessed.  The 

first  one  of  these  records  can  be  accessed  at  a cost  of 

p-j(R^)  and  the  remaining  (itk-1)  at  a cost  of  (itk-1)  *p2  (R_. ) . 

And  finally,  the  owner  records  of  the  in.  data  base  sets 

D 

(of  the  second  type)  must  be  accessed  to  obtain  all 
B-values  related  to  the  A-value  in  R^ (A,  B) . These  last 
riK  record  accesses  cost  m^p^fR^)  in  expected  page  fault 
rate.  Adding  to  the  above  4*p4  the  cost  of  four  privacy 
decisions  to  check  the  locks  on  the  two  data  items  A and 
B and  the  two  data  base  sets  we  get  the  following. 

TC  (21,  j,  0*)  = f^(P1  + (nij-l)p3(Rj)  + (in^-1)  p2  (Rj ) + 


+ 4p4) 

if  0*  = <FIND  ALL,  B,  A,  R ^ , fj>. 
Next  we  examine  the  time  cost  of  implementation  21  for 
Rj (A,  B)  in  an  operation  such  as 

0*  = <op , B,  A,  Rj (A,  B) , fj>  where  op  e {FIND  FIRST, 

STORE  FIRST}.  In  this  case  the  first  record  access  that 
locates  the  record  containing  a specific  A-value  costs  p^ . 
Then  only  one  dummy  record  is  accessed  from  that  record  at 


161 


* 


s 


V' 

I 


I 


a cost  of  Pj(Rj).  The  later  record  contains  the  address 
of  the  record  in  which  the  first  B-value  related  to  that 
specific  A-value  is  stored  and  it  is  accessed  at  a cost 
of  p^(Rj)-  Again  the  four  required  privacy  decisions  cost 
4p4  in  expected  page  fault  rate.  The  time  cost  of  imple- 
mentation 21  for  Rj (A,  B)  in  operations  of  the  above  form 
is  given  by 

TC  (21,  j,  0*)  = f*(Pl  + 2p3(R.)  + 4p4). 

We  can  also  arrive  at  the  same  formula  by  setting  nr  = 1 
in  the  time  cost  formula  derived  for 
0*  = <FIND  ALL,  B,  A,  R , f*>. 

The  derivation  of  time  costs  of  implementation  21  for 

k k 

R (A,  B)  in  operations  such  as  0p  = <op,  B,  A,  R.(A,  B) , f.> 
where  op  e {FIND  LAST,  STORE  LAST}  can  be  simplified  by 
observing  that  this  case  is  similar  to  the  case  where 
op  = FIND  ALL  discussed  above  with  the  difference  that 
instead  of  m^  owner  record  accesses  to  data  base  sets  of 
the  second  type  only  one  owner  record  access  is  necessary 
and  hence  we  have  the  following. 

TC  (21 , j,  0*)  = f^(p3  + 2p3(R.)  + (mj-l)p2(R.)  + 4p4> 

if  O*  - <op,  B,  A,  R , fj-* 

and  op  e {FIND  LAST,  STORE  LAST}. 

The  time  cost  of  implementation  21  for  R. (A,  B)  in  an 

Jr  J^ 

operations  such  as  0^  = <FIND  ITH,  B,  A,  R^,  f^>,  at  each 
execution  of  the  operation  requires  the  following  record 
accesses.  As  in  the  previous  case,  a record  containing  a 
specific  A-value  must  be  accessed  at  a time  cost  of  p^ . 


a 


1*2 


This  record  is  the  owner  of  a data  base  set  of  the  first 

type;  accessing  the  first  member  record  of  this  set  costs 

P3(RJ.  Since  the  required  i-th  3-value  related  to  the 

A-value  retrieved  above  (in  R ^ ) is  on  the  average  halfway 

down  the  list  of  such  B-values,  ^(m^-l)  of  the  dummy  records 

must  be  accessed  at  a cost  of  i (m . -1 ) p (R . ) , and  then  the 

* J / 3 

owner  of  the  final  dummy  record  in  its  role  as  a member  of 
a set  of  the  second  type  must  be  accessed  at  a cost  of 
P3(Rj).  Adding  to  the  above  4p^  for  the  cost  of  making 
necessary  privacy  decisions  gives  the  following  formula  for 
the  time  cost  of  implementation  21  for  R.(A,  B)  in  opera- 
tions  such  as  0^  = <op,  B,  A,  R_.  , f^>,  op  = FIND  ITH.  With 
similar  arguments  we  can  show  the  formula  to  be  valid  for 
all  op  e {FIND  ITH,  FIND  KEY,  STORE  KEY,  MODIFY,  ERASE}. 

TC (21,  j,  0*)  = f*(Pl  + 2p3(R.)  + i(S.-l)p2(R.)  + 4p4> 


if  0*  = <op,  B,  A,  R,.,  f£> 
and  op  e {FIND  ITH,  FIND  KEY, 


STORE  KEY,  MODIFY,  ERASE}. 
We  next  consider  the  time  cost  of  implementation  21 
for  R . (A,  B)  in  an  operation  such  as 

0*  = <FIND  NEXT,  B,  A,  R_.  , f^>.  At  each  execution  of  this 
operation  the  next  B-value,  b,  related  to  a specific  A-value, 
a,  in  Rj  is  to  be  retrieved.  Again,  as  explained  in  deri- 
vations of  time  costs  for  implementation  9,  it  is  assumed 
that  in  a previous  operation  a current  B-value  say  b' 


related  to  a in  was  accessed  and  now  the  next  B-value, 


163 


b,  related  to  a is  to  be  retrieved.  The  dununy  record 
associated  with  the  pair  <a,  b>  e R^  is  available  at  the 
time  of  the  execution  of  the  operation.  As  a member  record 
in  a set  of  the  first  type,  this  dummy  record  contains  the 
address  of  the  dummy  record  associated  with  <a,  b>  e R ^ . 
Accessing  the  latter  record  costs  p2(Rj)  in  expected  page 
fault  rate.  The  dummy  record  associated  with  the  pair 
<a,  b>  e R..  contains  the  address  of  the  record  in  which  b 
is  stored.  Accessing  this  record  costs  p^(R  ).  Three 
locks  must  be  accessed,  one  for  data  item  B and  the  other 
two  for  the  two  data  base  sets  at  a total  cost  of  3p4>  The 
total  cost  of  implementation  21  for  R^ (A,  B)  in 
0'^'  = <FIND  NEXT,  B,  A,  R_.  , f^>  is  then  given  by: 

TC  (21,  j,  oj)  = f£(p2(R.)  + p3(Rj)  + 3p4) 

The  last  operation  code  to  be  considered  is 
op  = FIND  PREV.  In  the  case  of  an  operation  such  as 
0*  = <FIND  PREV.,  B,  A,  R..(A,  B) , f£>,  the  prior  B-value, 
b,  with  respect  to  the  current  B-value,  b',  (established  in 
a previous  operation)  both  related  to  a specific  A-value, 
a,  in  R_.  must  be  retrieved.  The  dummy  record  associated 
with  - a,  b ' > t.  Rj  is  the  prior  record  (with  respect  to  the 
record  associated  with  <a,  b>  e R ^ ) in  the  chain  of  dummy 
-*  r her  records  of  a data  base  set  of  the  first  type  whose 
jwrio  :•  ;ord  contains  a. 

-i  rder  to  retreive  the  desired  B-value,  the  dummy 
..ited  with  <a,  b>  must  be  retrieved.  This 


164 


requires  1)  the  retrieval  of  the  record  (say  ) that  con- 
tains a using  the  proper  owner  pointer  in  the  dummy  record 
associated  with  the  pair  <a,  b>  and  then  2)  accessing  the 
first  member  record  in  the  set  whose  owner  is  A^  using  the 
member  pointer  in  A^  and  then  3)  retrieving  all  dummy 
records  (in  the  chain)  that  preceed  the  one  associated  with 
the  pair  <a,  b>  and  finally  4)  the  owner  record  in  a set  of 
the  second  type  (of  which  the  dummy  record  associated  with 
<a,  b'>  is  a member)  must  be  retrieved  using  the  proper 
owner  record  pointer.  The  above  record  traversals  cost 
Po(R.)f  p^(R.),  i (m  . -1 ) p_  (R  , ) and  p_.(R.)  respectively. 
Adding  to  the  above  the  time  cost  of  the  four  privacy 
decisions  4p4  gives  the  following  formula  for  the  time 
cost  of  implementation  21  for  R^ (A,  B)  in 
0*  = <FIND  PREV.,  B,  A,  R ^ , f*>. 

TC (21,  j,  0*)  = f*(3p3(R.)  + |(m.-l)p2(R.)  + 4p4>. 

Following  is  a summary  of  formulas  derived  above  for 
the  time  cost  of  implementation  21  for  R.(A,  B)  in  the  2.-th 
operation  of  k-th  run  unit  0^. 

TC  (21,  j,  0*)  = fj(px  + 2p3(P.)  + 4p4) 

if  op  e {FIND  FIRST,  STORE  FIRST} 

= + 2p3 (R j ) + (m-l)p2(R.)  + 4p4 ) 

if  op  e {FIND  LAST,  STORE  LAST} 

= f£(px  + 2p3(R.)  + i(m-l)p2(R.)  + 4p4) 
if  op  e {FIND  ITH,  FIND  KEY, 


STORE  KEY,  MODIFY,  ERASE} 


166 


hence  we  have . 

TC  (22,  j,  O*)  = TC (17,  j,  O*) 

And  similarly  the  following  is  true. 

TC  (23,  j,  o\)  = TC (18,  j,  0*) 

TC  ( 2 4 , j,  0*)  = TC (17,  j,  0*) 

if  Op  = <op,  B,  A,  R . (A,  B) , f*> 

3 x, 

and  op  e SOP 
= TC (18,  j,  0*) 

if  0*  » <op,  A,  B,  Rj (A,  B),  f*> 

and  op  e SOP 

= 0 otherwise. 

The  set  of  implementations  defined  and  analyzed  for 
storage  and  time  costs  in  this  chapter,  is  not  considered 
to  be  exhaustive  of  all  conceivable  ways  of  implementing 
a relation.  Especially  in  association  implementations 
several  other  modes  of  implementation  are  possible  which 
include  simple  or  multi-level  record  arrays.  Boolean  array, 
packed  Boolean  array  and  binary  tree.  In  a record  array, 
association  member  records  of  a set  are  stored  in  physi- 
cally sequential  locations  of  the  address  space,  whether 
or  not  each  member  is  in  turn  the  owner  of  some  other  set. 
In  a multi-level  record  array  association  (Figure  3.18(a)), 
the  owner  record  of  a set  is  followed  by  its  members. 

Each  member  in  turn,  if  it  is  an  owner  of  a set,  will  be 
followed  by  its  own  members,  etc. 

In  the  binary  tree  form  (Figure  3.18(b))  the  owner  of 


165 


= + (m-1 ) (Rj  ) + (m-1)  p2  (Rj)  + 4p4) 

if  op  = FIND  ALL 
= f ^ (P2 (Rj ) + P3(Rj)  + 3p4) 
if  op  = FIND  NEXT 
= f£(3p3(R.)  + |(m-l)p2(Rj)  + 4p4) 
if  op  = FIND  PREV. 

= 0 otherwise. 


where  in  the  above  m = m.  if  0^  = <op,  B,  A,  R . , f£>  and 

■B 

m = n.  if  O0  = <op.  A,  B,  R.,  f„>. 

3 *•  D *■ 

Note  that  in  the  case  of  implementation  21  for  , 
P2(R_.)  and  p^fR^)  are  computed  using  a different  formula, 
compared  to  the  formulas  for  all  other  implementations  and 
that  p2(Rj)  and  p2(R^)  are  equal  in  this  case  (see  3.4.1). 


3.4.11  Time  Costs  of  Implementations  22,  23  and  24 

Implementation  22  for  R^ (A,  B) , as  defined  in 
3.1.12,  is  the  single  linkage  of  B under  A.  The  structure 
of  the  implementation  is  very  much  similar  to  that  of 
implementation  17.  The  major  difference  between  the  two 
implementations  for  R ^ (A,  B)  is  that  in  17  a data  base  set 
should  be  defined  to  establish  relationships  in  R^  and 
therefore  only  one-to-iriany  relations  can  be  implemented  by 
17.  However,  in  the  case  of  a linkage  implementation  this 
limitation  is  not  necessary  and  in  fact  implementation  22 
may  be  used  to  implement  a many-to-many  relationship.  This 
difference,  however,  does  not  make  the  time  cost  of  imple- 
mentation 22  different  from  that  of  implementation  17  and 


168 


each  set  points  to  a member  and  each  member  points  to  two 
other  members  or  potentially  to  two  subtrees  of  member 
records.  There  are  several  variations  of  this  scheme.  In 
the  Boolean  array  mode  the  owner  of  each  set  contains  (or 
has  a pointer  to)  an  array  of  bits,  the  size  of  which  is 
at  least  equal  to  the  count  of  records  in  the  file  of  mem- 
ber records.  Each  bit  position  in  the  array  corresponds 
to  a particular  record.  A one  in  a bit  position  of  the 
bit  array  of  the  owner  record  of  a set  specifies  that  the 
corresponding  record  is  a member  of  the  set.  Again  other 
variations  of  this  scheme  are  possible  with  the  packed 
Boolean  array  as  an  example. 

Most  of  the  above  mentioned  set  implementations  can  be 
analyzed  and  incorporated  into  the  relation  implementation 
models  of  Section  3.1.  However , increasing  the  number  of 
implementation  alternatives  increases  the  computation  time 
of  the  optimization  program  very  rapidly.  We  have  restric- 
ted the  model  to  the  previously  defined  24  relation  imple- 
mentations because  it  covers  a sufficiently  large  class  of 
implementations  used  in  data  base  systems. 

Again  on  the  subject  of  association  implementations, 
it  is  possible  that  for  sets  with  very  large  number  of 
members,  some  sort  of  indexing  would  be  desirable  in  certain 
types  of  user  queries.  In  fact,  any  of  the  13  association 
implementations  (number  9 to  21)  can  be  indexed  or  not. 

Also  taking  into  account  the  fact  that  several  forms  of 
indexing  may  be  made  available,  the  set  of  possible  relation 


ixwiHfr 


i 


169 


implementations  would  grow  substantially  in  size  which  in 
turn  can  make  the  optimization  program  intractable. 

The  representation  of  individual  data  items  is  con- 
sidered to  be  fixed  (as  specified  by  user  input  data)  as 
opposed  to  being  in  control  of  the  schema  designer. 

Various  transformations  could  be  applied  to  data  item 
representations  which  would  affect  storage/time  costs  of 
individual  relation  implementations  and  thereby  affect  the 
outcome  of  the  optimization  problem.  These  transformations 
have  different  effects  depending  upon  the  purpose  they 
serve,  which  may  be  compaction/decompaction,  encoding/decod- 
ing, etc.  The  privacy  transformations  in  most  practical 
cases  require  some  extra  space  expenditure  in  addition  to 
the  time  expenditure  of  actually  performing  the  transfor- 
mations. Their  value,  however,  is  in  the  data  protection 
they  provide  and  depends  on  some  external  parameters  that 
are  not  modelled  here. 

On  the  other  hand,  the  compaction/decompaction  types 
of  transformations  are  supposed  to  save  storage  space  with 
some  processing  time  overhead.  In  order  to  incorporate 
this  into  the  model,  some  (may  be  one)  alternative  trans- 
formation options  must  be  specified  and  analyzed  so  that 
costs  and  values  could  be  determined.  Storage  savings  can 
easily  be  absorbed  by  the  previously  mentioned  storage 
cost  functions,  however,  the  incorporation  of  the  proces- 
sing times  into  our  time  cost  functions  would  be  a tedious 
and  an  inexect  job. 


170 


Set  types  that  we  consider  in  this  work  may  have  only 
one  member  record  type.  This  is  not  a strong  restriction, 
because  the  same  record  type  can  be  declared  as  owner  in 
many  different  set  types. 

Furthermore,  we  have  made  the  assumption  that,  within 
a set,  sequential  ordering  of  member  records  is  determined 
by  a key  which  is  one  of  the  data  items  within  the  record. 
The  DBTG  specif ications  allow  the  concatenation  of  several 
data  item  types  to  be  specified  as  a sort  key.  Inclusion 
of  this  feature  in  our  model  would  necessitate  the  knowledge 
of  record  structures  beforehand  which  is  a part  of  the 
design  goal. 

There  are,  of  course,  other  factors  that  may  have  some 
effects  on  the  cost  function  formulations  of  the  preceeding 
sections.  These  factors  include  overflow  mechanisms  and 
strategies  for  reclaiming  storage  spaces  made  available  by 
deletions.  Simultaneous  consideration  of  these  features  is 
beyond  the  scope  of  this  research. 


CHAPTER  IV 


DATA  BASE  DESIGN 

Iri  this  chapter  we  discuss  our  data  base  design 
methodology . In  Section  4.1  we  use  models  of  Chapters  2 
and  3 to  set  up  an  optimization  problem.  This  is  done  by 
first  identifying,  for  each  relation  in  the  data  base,  a 
set  of  acceptable  implementations  through  the  application 
of  constraint  conditions  developed  in  3.2.  Then  storage 
and  time  costs  for  each  of  the  acceptable  implementations 
of  a relation  are  determined  using  storage  and  time  cost 
functions  of  Sections  3.3  and  3.4,  respectively.  Next  in 
Section  4.2  we  will  develop  the  first  phase  of  the  optimi- 
zation algorithm  in  which  sequences  of  "undominated  choices 
are  generated  using  the  information  about  acceptability, 
storage  costs  and  time  costs  of  relation  implementations. 

In  Section  4 . 3 we  describe  the  second  phase  of  the  optimi- 
zation algorithm  in  which  a discrete  multi-stage  decision 
process  is  formulated  and  solved  through  a step-wise 
generation  of  "undominated  solutions."  Solutions  to  the 
final  stage  of  the  decision  process  determine  a series  of 
optimal  configurations  from  which  we  derive  our  optimal 
data  base  designs  in  Section  4.4. 

4.1  Problem  Set  Up 

The  data  base  designer  provides  an  application  in  the 
following  form.  A set  SI  of  data  items  types,  containing 


171 


172 


NI  elements,  is  provided.  For  each  data  item  type,  its 
name,  size  and  cardinality  must  be  given.  Similarly,  a 
set  SR  of  data  base  relations  is  provided  by  the  designer, 
that  contains  NR  relations.  For  each  relation,  its  name, 
cardinality,  origin  data  item  type,  destination  data  item 
type  and  two  multiplicity  vectors  <m,  in,  M>  and  <n,  n,  N> 
are  given.  Next  a set  SU  of  run  units  of  cardinality  NU 
is  given.  For  each  run  unit  a name  and  a number  that  gives 
the  number  of  operations  in  the  run  unit  must  be  specified 
along  with  the  operations  in  the  run  unit.  Each  operation 
in  turn  must  have  an  operation  code,  an  operand  data  item 
type,  a qualifier  data  item  type,  a relation  and  a fre- 
quency of  execution.  The  relation  specified  in  an  opera- 
tion must  be  a member  of  the  set  SR,  and  it  must  be  defined 
over  the  two  data  items  specified  as  operand  and  quali- 
fier (in  either  order)  in  the  operation.  The  last  set  of 
variables  that  the  designer  has  to  specify  determines  the 
values  of  storage  space  parameters:  MSPG,  PGSZ,  PTRS, 

CNTRS , and  PLOKS. 

The  above  information,  provided  by  the  designer  in 
accordance  with  the  models  developed  in  Chapter  2,  is  used 
as  input  to  the  data  base  design  procedure.  For  detailed 
definition  and  example  values  of  the  above  parameters, 
refer  to  Chapter  2. 

The  data  base  design  procedure  starts  with  the  task 
of  determining  the  type  of  each  relation.  In  Chapter  2 we 


173 


defined  eight  types  of  relations  (numbered  0 through  7) 
and  depending  on  its  multiplicity  vectors,  each  relation  is 
of  one  of  the  eight  types.  The  knowledge  of  the  type  of  each 
relation  simplifies  the  application  of  constraint  con- 

ditions in  order  to  determine  the  acceptability  of  imple- 
mentations for  the  relation. 

The  next  order  of  business  is  to  find  all  acceptable 
implementation  alternatives  that  can  be  used  correctly  to 
implement  each  individual  relation  in  SR,  from  among  the 
24  possible  implementation  alternatives  defined  in  Section 
3.1.  We  developed,  in  Section  3.2,  a set  of  con- 
straint conditions  for  each  implementation  alternative 
that,  when  satisfied  by  the  paramenters  specifying  a rela- 
tion, ensure  the  correctness  of  the  choice  of  the  imple- 
mentation for  that  relation. 

As  discussed  in  Section  3.2,  since  the  acceptability 
of  some  relation  implementations  depend  on  the  acceptability 
of  others,  a certain  order  must  be  maintained  in  the  appli- 
cation of  constraint  conditions.  In  particular,  the  ac- 
ceptability of  implementations  5,  6,  7 and  8 must  be  de- 
termined before  all  other  implementations  except  imple- 
mentations 1 and  2,  because  the  constraint  conditions  for 
those  implementations  for  a relation  depend  on  whether  or 
not  implementations  5,  6,  7 and  8 are  acceptable  for  the 


relation. 

The  design  procedure,  therefore,  first  applies  the 


* 


174 


constraint  conditions  of  implementations  1 and  2 for  all 
relations  in  SR  and  then  proceeds  with  application  of  the 
constraint  conditions  pertaining  to  implementations  5,  6,  7 
and  8 and  finally,  the  constraint  conditions  of  the  re- 
maining implementations  (3,  4,  9,  10,  , 24)  are  applied. 

It  is  evident  that  the  inter-dependence  of  the  constraint 
conditions  may  result  in  the  exclusion  of  some  otherwise 
acceptable  configurations,  but  it  is  essential  for  the 
proper  execution  of  our  optimization  algorithm,  that  the 
choice  of  an  implementation  for  one  relation  be  independent 
of  choices  made  for  other  relations.  Constraint  conditions 
stated  in  Section  3.2  are,  therefore,  only  sufficient 
conditions  and  they  may  reduce  the  size  of  the  solution 
space  in  which  we  are  looking  for  an  optimal  solution. 

In  that  sense,  the  solution  space  of  our  data  base  design 
method,  although  sufficiently  large,  is  smaller  than  the 
class  of  all  possible  DBTG  data  base  designs. 

The  information  about  acceptability  of  implementations 
for  each  relation  is  recorded  by  a set  of  variables 
AC (MPL , j),  where  MPL  =1,  2,  ...  , 24  designates  an  imple- 
mentation number  and  j = 1,  2,  ...  , NR  designates  a rela- 

tion number.  If  implementation  numbered  MPL  is  acceptable 
for  relation  R ^ , then  AC (MPL,  j)  =1;  otherwise, 

AC (MPL,  j)  =0.  The  set  of  acceptable  implementations  for 
relation  R ^ e SR  is  denoted  by  IMP(j)  and  defined  as 


follows : 


175 


IMP(j)  = {MPL  | MPL  e {1,  2,  ...  , 24}  and  AC (MPL,  j)  = 1} 
j = 1,  2,  ...  , NR 

Once  IMP ( j ) is  known  for  each  R ^ e SR,  we  can  compute 
the  storage  costs  of  relation  implementations.  Storage 
costs  are  computed  using  the  storage  cost  functions 
SC (MPL,  j),  MPL  e {1,  2,  ...  , 24},  j = 1,  2,  ...  , NR 
developed  in  Section  3.3.  Out  of  the  maximum  storage 
bound  of  TMSB  characters,  only  OMSB  characters  are  avail- 
able for  optimization  where: 

OMSB  = TMSB  - MMSB  and 

NI 

MMSB  = l ITSZ (i) *ITCR(i) , where  for  i = 1,  2,  ...  , NR 

i = 1 

ITCR(i)  = | ij  and  ITSZ(i)  = size  of  I± 

The  next  step  of  the  data  base  design  procedure  con- 
cerns the  computation  of  time  costs.  Time  costs  functions 
for  each  relation  implementation  <MPL,  R>  in  all  possible 
kinds  of  operations  were  developed  in  Section  3.4.  Opera- 
tions were  defined  in  Chapter  2 in  the  following  form: 

0 = <op.  A,  B,  R,  f>,  where  op  r SOP  is  an  operation 
code  in  the  set  of  all  possible  operation  codes 
SOP  = {FIND  FIRST,  FIND  LAST,  FIND  ITH,  FIND  NEXT,  FIND 
PREV.,  FIND  ALL,  FIND  KEY,  STORE  FIRST,  STORE  LAST,  STORE 
KEY,  MODIFY,  ERASE},  R e SR  is  a relation  on  A e SI  and 
B e SI  and  f is  the  operation's  frequency.  Depending  on 
whether  A = ORG(R)  and  B = DST(R)  or  B = ORG(R)  and 


176 


A = DST(R),  implementation  MPL  for  relation  R has  a dif- 
ferent  time  cost  function  in  operation  0 given  above. 

Since  the  operation  code  op  in  0 can  be  anyone  of  the  12 
operation  codes  in  SOP  and  there  is  a total  of  24  imple- 
mentation alternatives,  it  appears  that  there  is  a large 
number,  namely  576,  of  time  cost  functions  to  deal  with. 

This  is  not  true  because  a large  number  of  these  functions 
are  similar  in  form  and  different  only  in  parameters. 

For  example  for  0 = <FIND  ALL,  B,  A,  R ^ , f>  and  MPL  = 15 
the  time  cost  function  is 

TC (MPL,  j,  0)  = f(pL  + P3 (R j ) + (m^  - 1)  P2 (Rj ) + 3p4 ) 
and  for  the  same  operation  0 and  MPL  =16  the  time  cost 
function  is  given  by 

TC  (MPL,  j,  0)  = f(p1  + P3(Rj)  + (n^  - 1)  P2(Rj)  + 3P4)- 

When  all  such  similarities  in  time  cost  functions  are 
considered,  there  are  only  17  distinctly  different  func- 
tions that  are  used  in  time  cost  computations. 

At  this  stage  in  the  data  base  design  procedure,  we 
have,  for  each  relation  R^  e SR,  j = 1,  ...  , NR  the  fol- 

lowing information: 

1)  A set  of  24  variables  AC (MPL,  j),  MPL  = 1,2,..., 24  where 
AC (MPL,  j)  = 1 if  implementation  MPL  is  acceptable  for  R ^ , 

= 0 otherwise. 

2)  A set  IMP(j)  of  implementation  numbers  MPL  such  that 
AC (MPL,  j)  = 1 iff  MPL  e IMP(j). 

3)  A storage  cost  SC (MPL,  j)  associated  with  each 


AD-A045  544  MICHIGAN  UNIV  ANN  ARSON  SYSTEMS  ENGINEERING  LAB  F/G  5/2 

A METHODOLOGY  FOR  OATA  BASE  DESIGN  IN  A PAGING  ENVIRONMENT. <U) 

SEP  77  C BCRELIAN*  K B IRANI  F30602-76-C-0029 

UNCLASSIFICO  RADC-TR-77-292  NL 


/ 

3OF  4 

^?>45544 

a 

Hi 

177 


implementation  MPL  e IMP(j)  for  relation  R j . 

4)  A time  cost  TC(MPL,  j)  associated  with.  each,  implementa- 
tion MPL  £ IMP ( j ) for  relation  R ^ . 

In  Chapter  3 we  defined  a configuration  C for  SR  to 
be  a set  of  ordered  pairs  of  relation  implementations  where 
there  is  one  and  only  one  element  <R  j , MPLj>  £ C for  each 
and  every  element  Rj  £ SR.  In  set  theoretic  terminology, 
a configuration  C for  a set  of  data  base  relations  SR  is 
a function  from  SR  to  the  set  of  integers  {1,  2,  ...  , 24}. 
A configuration  C for  SR  is  called  an  acceptable  con- 
figuration (for  SR)  if  for  all  <Rj  , MPLj > £ C, 

AC (MPL j , j)  =1.  For  such  configurations,  we  also  defined 
storage  and  time  costs,  respectively,  as  follows: 


CS(C)  = l SC  (MPL^  j) 

j = 1 

NR 

CT (C ) = l TC (MPLj , j) 

j « 1 


(Note:  NR  = SR  I = | C I ) 


In  the  above  two  expressions  SC (MPL j , j)  and  TC (MPLj , j) 

are  the  storage  and  time  costs  of  relation  implementation 

<R  j , MPIi j > e C,  respectively. 

The  problem  is,  now,  to  find  an  acceptable  configura- 
. * 

tion  C for  SR  for  some  given  storage  limit  L 5 TMSB , 

* 

where  C is  defined  as  follows: 


(4.1.1)  CS(C  ) £ L - MMSB 

(4.1.2)  CT(C  ) < CT(C)  for  all  acceptable 


178 


configurations  C for  which  CS(C)  < L - MMSB. 

If  more  than  one  configuration  with  different  storage 

costs  but  equal  time  costs  satisfy  (4.1.1)  and  (4.1.2), 

* 

then  C is,  by  definition,  the  configuration  with  the 
lowest  storage  cost.  If,  however,  more  than  one  configu- 
ration with  equal  storage  and  time  costs  satisfy  (4.1.1) 
and  (4.1.2),  then  the  choice  of  the  optimal  configura- 
tion among  those  configurations  is  arbitrary. 

* 

We  call  C an  "optimal"  configuration.  The  opti- 

* 

mality  of  C is,  however,  within  a sub-class  of  DBTG 
structures  because  of  the  interdependence  of  constraint 
conditions  we  discussed  earlier  in  this  section.  It  is 
also  qualified  by  the  observations  we  will  make  in  Chap- 
ter 5 regarding  errors  in  time  cost  computations. 

The  existance  of  an  optimal  configuration  depends  on 
two  factors.  The  first  factor  is  the  existance  of  at 
least  one  acceptable  implementation  for  each  relation  in 
SR  and  it  is  determined  by  the  constraint  conditions  only. 
In  terms  of  the  above  variables  we  must  have: 

| IMP ( j ) 1 2 1 for  all  Rj  e SR  (4.1.3) 

The  second  factor  is  the  value  of  the  storage  limit  L. 

It  can  be  shown  that  even  when  (4.1.3)  is  true,  there 
exists  a limit  LQ  such  that  for  any  L < Lq  no  optimal 
configuration  exists.  The  fact  that  there  exists  a mini- 
mum Lq  on  L is  obvious  because  the  set  of  acceptable 
configurations  is  a finite  set  and  the  cumulative  storage 


179 


cost  CS (C)  is  a real  function  defined  for  the  elements 

of  that  set.  The  value  of  L can  be  derived  as  follows. 

o 

Let  MPLj , j = 1,  2,  ...  , NR  be  the  implementation  with 
the  smallest  storage  cost  for  R^,  i.e. 

SC(MPL~,  j)  = min  SC(MPL..,  j),  j - 1,  2,  ...  , NR 

MPL j £ IMP ( j ) (4.1.4) 

The  existance  of  MPL ^ , j = 1 , 2,  ...  , NR  is  guaranteed 
by  (4.1.3).  Then  Lq  is  given  by  the  following: 

NR 

L = MMSB  + F SC (MPL  . , j) 
o *•  j 

j - 1 

If  (4.1.3)  holds  and  L > Lq,  however,  an  optimal  con- 

★ 

figuration  C (i.e.  an  acceptable  configuration  that  satis- 
fies (4.1.1)  and  (4.1.2))  exists  because  the  set  of  ac- 

* 

ceptable  configurations  is  finite.  C can  be  determined 
by  the  following  method: 

1)  Generate  all  acceptable  configurations  for  SR. 

There  is  at  least  one  such  configuration  because  (4.1.3) 

NR 

is  assumed  to  be  true.  There  are  | ] | IMP  ( j ) J such 

j = 1 

configurations . 

2)  Compute  CS(C)  for  all  configurations  C generated 
in  1)  . 

3)  Retain  all  configurations  C for  which 
CS(C)  S L - MMSB  and  discard  all  those  for  which 

CS(C)  > L - MMSB.  The  retained  set  of  configurations  is 
not  an  empty  set  because  we  assumed  TMSB  > L > Lq  and 
every  configuration  in  the  set  satisfies  (4.1.1). 


180 


4)  Compute  CT(C)  for  all  C in  the  set  of  configura- 
tions retained  in  3) . 

5)  C*  is  a configuration  in  the  retained  set  that 
has  minimum  time  cost  (thereby  (4.1.2)  is  satisfied). 

The  optimal  configuration,  however,  is  not  necessar- 
ily unique  because  it  is  possible  for  two  configurations 
and  C ^ to  have  identical  storage  costs  and  identical 
time  costs,  i.e.  CS(C^)  = CS(C2)  and  CT(C^)  = CT(C2). 

In  our  implementation  of  the  optimization  algorithm,  the 
choice  of  the  optimal  configuration  when  two  or  more  con- 
figurations satisfy  the  conditions  of  optimality  is  arbi- 
trary. 

* 

The  method  described  above  for  finding  C is  not 
practical  because  of  the  large  number  of  configurations 
that  must  be  generated  in  step  1) . In  our  application 
example  of  Chapter  2,  we  have  26  relations  in  SR  and  the 
cardinality  of  the  set  IMP(j)  for  j = 1,  2,  ...  , 26  is 
tabulated  in  Table  4.1.1. 

j | IMP ( j ) 1 j | IMP ( j ) | 

1 13  8 4 

2 13  9 4 

3 9 10  9 

4 13  11  5 

I 5 13  12  9 

6 10  13  12 

7 4 14  13 


Table  4.1.1  Number  of  acceptable  implementations  for 


181 


j 

[ IMP  t j ) 

1 j 

| IMP ( j ) 

15 

12 

21 

12 

16 

11 

22 

9 

17 

12 

23 

5 

18 

13 

24 

13 

19 

11 

25 

13 

20 

12 

26 

9 

Table  4.1.1 

Ccontinued) 

The  number  of  acceptable  configurations  to  be  gen- 

NR 

erated  in  step  1)  of  the  above  method  is  | [ ) IMP ( j ) | - 

j = 1 

For  our  example  application,  this  number  is  over 

25  9 

2.32  x io  . Even  if  we  could  examine  10  configurations 

8 

per  second,  it  would  take  more  than  7 x 10  years  to 
examine  that  many  configurations. 

In  order  to  get  around  this  problem  one  can  make  use 
of  the  dynamic  programming  concepts  in  the  manner  discussed 
below.  We  first  define  a partial  configuration  PC(SSR) 
for  the  ret  of  relations  SSR  c SR  as  the  following  func- 
tion: 

PC(SSR)  : SSR  ->  {1,  2,  ...  , 24} 

such  that  if  <R.,  MPL.>  e PC(SSR) 

l i 

then  AC(MPLi,  i)  =1. 

For  each  subset  SSR  of  SR  consisting  of  relations 

R.  , R.  , ...  , R.  , PC ( SSR)  determines  a partial 
X1  X2  1 | SSR | 

assignment  of  acceptable  implementations  for  relations  in 
SSR.  Note  that  PC (SR)  is  an  acceptable  configuration  for 


182 


SR  as  defined  earlier  in  this  section.  We  can  similar- 
ly define  cumulative  storage  and  time  costs  for  PC  (SSR), 

SSR  £ SR  as  follows: 

If  SSR  = {R.  , R.  , ...  , R.  } where  j = |SSR|  and 
X1  *2  Xj 

PC (SSR)  = { <R . , MPL . > | k = 1,  2,  ...  , j},  then 

CS  (PC  (SSR)  ) = l SC  (MPL.  , i.)  and 
k = 1 xk  K 

j 

CT  (PC  (SSR)  ) = I TC  (MPL . , i.  ), 
k = 1 Xk  K 

where  for  k=l,  2,  ...  , j,  SC (MPL.  , i ) and 

xk 

TC (MPL . , i.  ) are  storage  and  time  costs  of  implementation 
1k  K 

MPL.  for  R.  , respectively. 
xk  1k 

In  the  special  case  where  SSR  = {R.  |k  = 1,  2,  ...,  j} 

k 

is  the  set  of  the  first  j relations  in  SR,  namely  if 
i^  = k for  k = 1,  2,  ...  , j we  will  refer  to  PC(SSR) 
simply  as  PC ( j ) . 

Let  SPCj (9 ) be  the  set  of  all  PC(j)  such  that 
CS ( PC ( j ) ) £ S.  The  optimal  (or  objective)  value  (partial) 
function  OVF ^ (S ) is  defined  by  (4.1.5)  below  if  SPCj(S) 
is  non-empty.  It  is  undefined  otherwise. 

OVF..(S)  = min  CT(PC(j))  (4.1.5) 

PC(j)  e SPCj(S) 

For  each  pair  <j,  $>,  OVF^ts)  is  called  the  optimal 

value  function,  or  the  optimal  j-stage  solution  to  state 

* 

S.  The  partial  configuration  PC  (j)  e SPC ^ (s ) that  satis- 
fies (4.1.5)  determines  the  partial  assignment  of  imple- 


183 


mentations  to  relations  Rlf  R2,  ...  , R^ . The  effect  of 
this  partial  policy  (j-stage  solution)  on  later  decisions 
is  summarized  in  S.  In  other  words,  if  a total  of  s bytes 
of  storage  space  is  required  for  the  first  j relation 
implementations,  then  (L  - MMSB  - S)  bytes  remain  for  the 

R^.  This  is  the 
entire  information  needed  from  the  first  j-stage  decision 
to  make  later  decisions  and  since  S summarizes  the  entire 
past  history  in  the  decision  process,  we  call  it  the 
state  variable. 


implementation  of  R^+^,  Rj+2 


9 • • • 9 


In  the  case  of  our  problem, Bellman ' s principle  of 
optimality  [Bellman  1957]  can  be  stated  as  follows: 

If  an  optimal  configuration 

C*  = (<R1,  MPL*>,  <R2,  MPL*>,  ...  , <RNRf  MPL*r>}  for  SR 

* 

has  a partial  storage  cost  of  = CS(PC  (NR  - 1))  for 

* * * 

the  partial  configuration  PC  (NR  - 1)  = C - {<RjjR,  MPL  R>}, 

* 

then  PC  (NR  - 1)  must  be  an  optimal  configuration  for 
SR  - subject  to  a storage  cost  limit  of  S^.  Or 

stated  differently,  suppose  an  optimal  policy  for  all  NR 
stages  (stages  are  relations  in  this  case)  has  a storaqe 
cost  of  S^  bytes  for  the  first  (NR  - 1)  stages.  Then  it 
must  have  made  the  optimal  selection  of  implementations 
for  relations  R^  R2,  ...  , subject  to  a storage 

constraint  of  S^.  This  principle  can  be  seen  easily  by 

the  following  formal  reasoning. 

* 

Suppose  C is  an  optimal  configuration  that  selects 


184 


* * * 

implementation  MPLNR  for  R^  ^<rnr*  MPLnr>  c C ) . Sup- 
pose further  that  the  partial  configuration 

* * * 

PC  (NR  - 1)  = C - MPLNR>}  is  n0t  °Ptimal*  Then 

an  optimal  partial  configuration 

PC+ (NR  - 1)  = {<Rl,  MPL*>,  <R2,  MPL2>,  ...  , 

^>}  must  exist  whose  time  cost  is  less 

* 

than  the  time  cost  of  PC  (NR  - 1) . The  configuration 

C+  = PC + (NR  - 1)  U (<rnr»  MPLNr>}  should  then  have  a time 

* 

cost  that  is  less  than  CT(C  ) . This  contradicts  the  as- 

* 

sumption  that  C was  an  optimal  configuration. 

The  above  principle  leads  to  the  following  algorithm: 
Dynamic  Programming  Algorithm: 

/*  Initialization  */ 

For  S = 1 until  OMSB  do; 


OVF^ (S)  = min  TC (MPL^ , 1)  ; 


MPL^  e IMP ( 1)  & SC(MPLlf  1)  < S 

end; 

/* 

V 

/*  Iteration 

V 

For 

j = 2 

unti 1 NR  do  ; 

For  S 

= 1 until  OMSB  do  ; 

OVF j (S) 

= min  (OVF (S-SC (MPL^  , j ) ) +TC (MPL^  » j ) ] 

MPL.  e IMP ( j ) 

3 

end; 

end ; 

/* 

V 

:Rnr  - 1' 


MPL, 


NR  - 


185 


The  final  goal  is  to  determine  the  optimal  value 
function  for  j = NR,  namely  OVFNR(S) . OVFNR(S)  can  de- 
termine the  value  of  the  time  cost  of  the  optimal  con- 
figuration from  which  individual  relation  implementations 
(in  the  configuration)  may  be  traced  back  (from  NR  to  1). 
There  is  clearly  another  step  that  must  be  added  to  the 
iteration  part  of  the  algorithm  to  keep  track  of  the 
optimal  implementation  selected  at  each  S and  each  j in 

that  it  may  be  output  as  the  solution.  Note  further 
or  a storage  limit  L,  Lq  5 L < TMSB  the  optimal 
value  function  OVF  (L  - MMSB)  does  not  necessarily  give 

N K 

* 

the  time  cost  of  the  optimal  configuration  C but  it  has 
to  be  derived  from  the  following: 

CT(C*)  = min  OVF(S) 

SSL-  MMSB 

The  algorithm  as  stated  above,  however,  requires 
| IMP (j) | subtractions,  additions  and  comparisons  at  itera- 
tion j for  computing  OVF^(S)  for  each  S.  The  amount  of 
compution  at  iteration  j is  then  proportional  to 
OMSB  * | IMP ( j ) | . Total  computation  for  all  NR  iterations 

(initialization  included)  is  then  proportional  to 
NR 

OMSB  * l | IMP ( j ) | . 

j = 1 

The  key  difference  between  the  complexities  of  the 
dynamic  programming  approach  and  enumeration  is  that  in  the 
former  the  complexity  is  proportional  to  o = [ |lMP(j) | 

j = 1 


186 


NR 

and  in  the  latter,  it  is  proportional  to  n » ’|  f | IMP  ( j ) | 

j = 1 

which  is  clearly  a considerable  difference.  In  our  ap- 
plication  example  ir  = 2.32  x 10  and  a = 263.  Although 
the  dynamic  programming  algorithm  stated  above  achieves 
considerable  savings  in  computation  time  as  compared  to 
enumeration,  it  still  requires  a relatively  large  number 
of  steps  to  be  considered.  This  is  because  at  each  itera- 
tion j,  OVFj (S)  must  be  computed  for  each  S from  1 to 
OMSB  and  this  introduces  the  factor  OMSB  in  the  expression 
for  the  complexity  of  the  algorithm.  We  recall  that  OMSB 
is  the  amount  of  storage  available  for  optimization  given 
by  OMSB  = TMSB  - MMSB.  This  value  is  on  the  order  of  a 
million  (characters).  One  remedy  for  this  problem  is  to 
reduce  the  number  of  distinct  S values  considered  at  each 
iteration  to  a given  maximum  of  say  1000  points  by  scaling 
all  storage  costs  and  rounding-off  the  results  to  the 
nearest  integer  (see  [Mitoma  1975]).  That  is  to  say,  if 
OMSB  is  greater  than  1000  characters  (which  almost  always 
is  the  case),  then  reduce  every  storage  cost,  say  S,  to 
the  closest  integer  to  (1000/OMSB)  * S.  This  is  equiva** 
lent  to  measuring  storage  costs  in  blocks  of  (OMSB/1000) 
characters  rather  than  in  blocks  of  one  character  and 
also  by  replacing  OMSB  with  1000  in  the  algorithm.  The 
rescaling  method  has  the  side  effect  of  introducing  round- 
off errors  whose  effects  on  the  outcome  of  the  optimiza- 
tion are  not  exactly  known. 


187 


In  the  following  two  sections  we  introduce  an  al- 
gorithm that  solves  our  optimization  problem  in  con- 
siderably fewer  steps  without  the  requirement 

of  rescaling  storage  costs. 

4.2  Undominated  Choices 
Definition  4.2.1 

Let  MPL  be  an  acceptable  implementation  for  relation 
R j , i.e.  AC (MPL,  j)  =1.  The  storage  and  time  costs  of 
implementation  MPL  for  relation  are,  then,  defined  and 
known.  If  S = SC (MPL,  j)  and  T = TC(MPL,  j) , then  we 
shall  call  the  triple  <MPL,  S,  T>  a choice  for  the  rela- 
tion R . . Q 

1 

Definition  4.2.2 

Let  <MPL,  S,  T>  and  <MPL' , S',  T'>  be  two  choices  for 
Rj  such  that  S < S' , then  <MPL,  S,  T>  is  said  to  dominate 
<MPL* , S',  T’>  if  T < T'.  If  there  are  two  or  more 
choices  for  a relation  such  that  their  second  and  third 
coordinates  are  respectively  equal,  then  all  but  one  are 
arbitrarily  selected  as  dominated.  Cl 

The  motivation  for  this  definition  of  dominance  be- 
comes apparent  by  the  following  observation: 

Observation  4.2.1 

* 

If  C is  the  optimal  configuration  for  SR  with  a 

* 

storage  limit  of  L,  i.e.  if  C satisfies  (4.1.1)  and 

(4.1.2),  then  for  no  relation  implementation  pair 

* * 

<Rj , MPL j > e C , there  exists  a choice  <MPL,  S,  T>  for 


188 


* * * * 

Rj , that  dominates  <MPL^  , S , T > and  MPI.  ^ MPL^  where 
S*  = SC (MPL* , j)  and  T*  = TCCJVIPL*,  j). 

* * * 

Proof : If  <MPL,  S,  T>  dominates  <MPL^ , S , T >,  then 

■k  It  4. 

by  definition  we  must  have  S £ S anc’  T £ T . Let  C be 

* * 

the  configuration  obtained  from  C by  replacing  <Rj*  MPL^> 
with  <R_.,  MPL>,  i.e.  C+  = (C*  - {<R_j,  MPL*> })  U{<R.. , MPL>}. 

If  T < T*  and  S £ S*,  then  CT(C+)  < CT(C*)  and 

+ * + 

CS  (C  ) £ CS (C  ) which  in  turn  implies  that  C has  a lower 

* 

time  cost  than  C ; a contradiction  to  the  -ssumption  that 

* it  it  ^ It 

C was  optimal.  If  T = T and  S < S , then  C’  and  C 

have  equal  time  costs  but  C+  has  a lower  storage  cost  than 

* + 

S , which  by  definition,  makes  C optimal,  a contradiction 

* * + * 

again.  Note  that  if  T = T and  S = S , then  C and  C 

have  equal  storage  and  time  costs;  they  are  both  optimal. 
The  argument  still  holds  because  the  choice  of  the  optimal 
configuration  is,  by  definition,  arbitrary.  ^ 

It  follows  from  Observation  4.2.1  that  no  implementa- 
tion will  ever  be  selected  for  any  relation  if  the  imple- 
mentation's number  is  the  first  coordinate  of  a dominated 
choice  for  that  relation.  It  is,  therefore,  desirable 
to  exclude  those  implementations  (for  each  relation)  that 
are  associated  with  dominated  choices  from  consideration 
as  potential  implementation  alternatives  for  the  relation. 
Definition  4.2.3 

If  there  are  two  choices  <MPL,  S,  T>  and 
<MPL' , S',  T ' > for  relation  R^  such  that  neither  one 


189 


dominates  the  other,  then  the  two  choices  are  said  to  be 
a pair  of  undominated  choices  for  R ^ . 

Observation  4.2.2 

Let  q = <MPL,  S,  T>  and  q'  = <MPL' , S',  T'>  be  a 
pair  of  undominated  choices  for  relation  R.  Then  the 
following  are  true: 

i)  S / S'  and  T ^ T' 

ii)  S < S'  iff  T > T' 

iii)  S > S'  iff  T < T' 

Proof : q and  q'  form  a pair  of  undominated  choices 

for  relation  R iff  the  following  is  true: 

— » (S  < S'  a T < T')  a — I (S'  < S a T'  < T) 
where  a , v and  — i are  the  usual  logical  AND,  OR  and  NOT 
operations,  respectively,  with  the  conventional  precedence 
relations.  The  above  logical  expression  is  equivalent 
to: 

((S'  < S)  a (T  < T'))  v ( (S  < S')  A (T'  < T)  ) 

From  this  expression,  the  results  i) , ii)  and  iii) 
follow.  O 

Definition  4.2.4 

A set  of  choices  {<MPL.,  S..  , T. >,...,  <MPL  , S , T >} 

xxx  r r r 

is  said  to  be  an  undominated  set  of  choices  for  a data 
base  relation  R if  every  pair  of  elements  of  the  set  is  a 
pair  of  undominated  choices  for  R.  ^ 


Let  Q be  a set  of  undominated  choices  for  relation  R, 
where  Q = {<MPLif  S^,  T^>  | i = 1,  ...  , p).  Since  by 


190 


definition  for  all  i j*  j c {1,  ...  , p},  <MPL^,  S^,  T^> 
and  <MPL , Sj,  Tj > form  a pair  of  undominated  choices, 
we  know  from  part  i)  of  Observation  4.2.2  that  for 
ij<  j e {1,  2,  ...  , p},  Si  ? S..  and  T±  / T...  We  shall, 
therefore,  assume,  with  no  loss  of  generality,  that 
< S2  <,  ...  , < Sp,  or  equivalently,  for 


1,  2,  ...  , p — 1»  < Sj 


From  part  ii)  of 


Observation  4.2.2,  it  follows  that  for  i = 1,  ...  ,p-l, 
Ti  * Ti  + r 


Theorem  4.2.1 

Let  Q = {qi  | qi  = <MPL^ , Ti>,  i = 1,  ...  , p} 

be  a set  of  undominated  choices  for  relation  R with  the 
property  that  Sj,  < Si  + 1 (and  thereby  + for 

all  i = 1,  2,  ...  , p - 1.  Let  q = <MPL,  S,  T>, 

MPL  ? MPL^,  i = 1,  2,  ...  , p,  be  another  choice  for  R 
such  that  S > Sp.  Then  q is  pairwise  undominated  with 
q^,  i = 1,  2,  ...  ,p  iff  q and  qp  is  a pair  of  undomi- 
nated choices  for  R. 


Proof ; The  "only  if"  part  of  the  proof  is  trivial. 

In  order  to  prove  the  "if"  part,  we  first  observe  that  q 
cannot  dominate  any  q^  e Q.  This  is  because  S > Sp 
which  implies  S > S^,  i = 1,  ...  , p.  (See  Definition 
4.2.2).  We  now  have  to  show  that  no  q^  e Q dominates  q 
to  complete  the  proof.  Suppose  q,  e Q dominates  q,  for 
some  j c {1.  2,  ...  , p - 1}.  Note  that  j ? p because 
q and  q form  a pair  of  undominated  choices,  by  assumption. 


191 


Since  j < p,  we  know  that  S..  < Sp,  which  in  conjunction 
with.  Sp  < S implies  S > S ^ . S > S^  in  conjunction  with 
the  assumption  that  q^  dominates  q implies  T i (Defi- 
nition 4.2.2).  But  again  since  j < p,  we  know  that 

T.  > T , and  hence,  T > T . Referring  back  to  Definition 
DP  P y 

4.2.2,  it  is  clear  that  qp  must  dominate  q (because  we 
arriyed  at  < T and  Sp  < S) , a contradiction.  □ 

A useful  corollary  of  the  above  theorem  leads  us  to 
an  efficient  algorithm  for  generating  an  ordered  set  of 
undominated  choices  for  each  relation  in  SR. 


Corollary  4.2.1 


be  a 


Let  Q = {^i  I — ^MPL^ , S ^ ^ — ^ » 2,...,  p—  1 } 
~'r  undominated  choices  for  relation  R with  the 


property  that  Si  < + 1 (and,  therefore,  TV  > T\  + 

for  all  i = 1,  2,  ...  ,p-2.  Let  q = <MPL  , S , T > 

P P P P 

be  another  choice  for  R such  that  S > S . . Let  Q+ 

P P ” 1 

be  an  undominated  set  of  choices  for  R involving  the  maxi- 
mum number  of  elements  of  Q U {qp},  such  that  for  all 
q £ Q U (q  } - Q+,  q is  dominated  by  some  q'  e Q+.  Then: 


i)  Q+  = Q U {q  } iff  S > S . and  T 
P P P “ 1 

ii)  Q+  = Q U {q  } - (q„  ,}  if  S^  = S^ 

P P - 1 P p 


< T . 
P P “ 1 


P - 1 


192 


Proof : i) : The  "if"  part  is  a direct  consequence 

of  Theorem  4.2.1  because  (S  > S _ . and  T < T _ .) 

r r ^ r r A 

implies  that  qp  and  qp  _ ^ is  an  undominated  pair.  Maxi- 
mality  of  Q+  is  clear  because  in  this  case  Q+  = Q U {qp}. 
For  the  "only  if"  part,  we  observe  that  since  Q+  is  un- 
dominated, hence  q , and  q is  an  undominated  pair,  and 

pi  p 

since  S > S , is  assumed,  it  follows  from  Observation 
p p - 1 

4.2.2  that  S„>  S„  , and  T < T .. 

P P~1  P P 1 

ii)  : Suppose  sp  ~ sp  _ i and  Tp  < Tp  - 1*  These  two 

relations  in  conjunction  with  _ 2 < ^p  _ i and 

T , < T _ (known  from  definition  of  Q)  result  in: 

p - 1 p - 2 

S _ < and  T , > T . 
p-2  p P~2  p 

It  follows  from  the  latter  two  relations  that  q and  q _ 

is  an  undominated  pair  of  choices  for  R.  Then  by  Theorem 

4.2.1,  Q U (qp)  - (qp  is  an  undominated  set  of  choices 

for  R.  Furthermore,  q^  _ ^ is  the  only  member  of  the  set 

Q U (q  } - Q+  and  it  is  dominated  by  q e Q+.  The  set  of 
P P 

choices  Q U fqp)  not  undominated.  At  least  one  of  its 

elements  must  be  deleted  so  that  the  remaining  elements 

can  form  an  undominated  set.  Since  Q+  is  obtained  from 

Q U (q  } by  deleting  q . , maximality  of  Q+  is  estab- 
p p — 1 

lished. 

iii)  : The  proof  of  this  part  is  trivial. 

iv) : Since  S » S . is  assumed,  the  only  re- 
P P ^ 

maining  cases  are  when  either  (4.2.1)  or  (4.2.2)  is  true 
in  both  of  which  cases  qp  _ ^ dominates  qp.  Again 


193 


maximality  of  Q is  clear  because  is  the  only  member 

of  the  set  Q U {q  } - Q+,  it  is  dominated  by  q . e Q 

P pi 

and  at  least  one  element  of  Q U {q^}  must  be  discarded  so 
that  the  remaining  elements  can  form  an  undominated  set 
of  choices  for  R.  □ 

a T > T . (4.2.1) 


S > S . 
P P “ 1 


T > T . 
P P ~ 1 


S = S . 
P P “ 1 


A T > T . 
P P ~ 1 


(4.2.2) 

The  following  algorithm  follows  immediately  from  the 
above  corollary.  The  purpose  of  the  algorithm  is  to  gen- 
erate (logically)  a set  of  undominated  choices  ordered  in 
the  increasing  sequence  of  storage  costs,  for  each  rela- 
tion R.  e SR. 

3 

Algorithm  4.2.1 

For  j = 1 until  NR  do; 

1.  Sort  all  p'  = | IMP ( j ) | choices  for  relation 
Rj  in  nondecreasing  sequence  of  storage 
costs  (2nd  Coordinate) . In  case  of  equal 
storage  costs,  sort  in  nondecreasing  se- 
quence of  time  costs  (3rd  Coordinate) . Let 


Q»  = {qx , q?. 


, qpl}  be  the  ordered  set 


of  choices  for  R^  so  obtained. 

2.  Initialize  the  (ordered)  set  of  maximal 
undominated  choices  for  R^  to  the  empty  set. 
Initialize  Best  Time  Cost  BTC  to  infinity. 

3.  For  p = 1 until  p'  do; 


194 


Let  Tp  be  the  time  cost  (3rd  Coordinate)  of 

qp  e Q'x 

If  T < BTC  then  do? 

Qj  = Qj  U (q  1;  make  q the  last  element 
3 J P P 

of  the  new  Q . ; 

3 

BTC  = T ; 

P 

end ; 

end; 


end; 

The  sorting  of  p1  = | IMP ( j ) | elements  in  step  1 has 

2 

a time  complexity  proportional  to  p'  . The  complexity  of 
the  loop  in  step  3 is  proportional  to  p'.  The  algorithm 
then  has  a total  time  complexity  proportional  to 
NR  * f(p")  where  f(p")  is  a polynomial  of  degree  two  in 
p"  = Avg  | IMP ( j ) | over  j from  1 to  NR. 

Algorithm  4.2.1  constitutes  the  first  phase  of  our 
optimization  process.  We  recall  from  Section  4.1  that  a 
necessary  condition  for  the  existance  of  an  optimal  con- 
figuration is  the  following:  | IMP ( j ) | 2 1 

V j e (1,  2,  ...  , NR}.  This  condition  guarantees  that 


there  exists  at  least  one  choice  for  every  relation 


R.  e SR. 
3 


In  general,  there  are  more  than  one  acceptable 


implementations  (and  thereby  choices)  for  a relation.  How- 


ever, it  may  be  the  case  that  for  some  relation  R ^ , there 
exists  a choice  which  dominates  all  other  choices.  In 


such  a case  the  maximal  set  of  undominated  choices 


for 


relation  contains  only  one  element,  say 

q.  = <MPL . , S , , T . > . It  is  clear  that  the  optimal  choice 
3 3 j 1 

of  an  implementation  for  in  this  case  is  MPL ^ and  that 
relation  R ^ may  be  excluded  from  consideration  by  the 
secord  phase  of  optimization.  The  storage  and  time  costs 
of  all  such  relations  will  be  accumulated  and  denoted  by 
BASC  and  BATC,  respectively. 

Thus,  after  the  first  phase  of  optimization  is  com- 
pleted, we  have  a number  MR  5 NR  of  relations  that  con- 
stitute MR  stages  of  a decision  process  to  be  solved  in 
the  second  phase  of  optimization.  Each  stage  k, 
k = 1,  ...  , MR,  has  the  following  data: 


RL(k) 


is  the  index  j e {1,  2,  ...  , NR}  of 


R,  e SR  which  is  the  relation  associated 
3 

with  stage  k. 


°RL(k)  £ 0 


is  the  maximal  set  of  undominated 


NCHS(k) 


FMP ( i , k) 


choices  for  stage  k,  ordered  in  the  in- 
creasing sequence  of  storage  costs, 
is  the  number  of  elements  of  QRl (k) ' 
i.e.  the  maximal  number  of  undominated 
choices  for  stage  k. 

is  the  implementation  number  associated 
with  (the  first  coordinate  of)  the  i-th 


choice  in  Qrl(jc)  for 

Is  1,  2,  ...  , 


NCHS(k) 


196 


Clearly,  the  information  content  of  Q__ (k) , 
k = 1,  ....  , MR  can  be  derived  from  FMP (i , k) , 
i * 1,  2,  ...  , NCHS(k),  k = 1,  2,  ...  , MR  and  RL (k) 
using  SC( MPL,  j)  and  TC(MPL,  j)  tables  derived  earlier. 

For  example  the  i-th  choice  of  k-th  stage  is: 

q*  = <FMP ( i , k),  SC (FMP ( i , k) , RL(k) ) , TC(FMP(i,  k) , RL(k))> 
or  equivalently  q^  - <MPL^,  S^,  T^> 
where  MPL*  = FMP ( i , k) 

S*  = SC (MPL*,  j) 

= TC(MPL*,  j) 
j = RL(k) 


The  matrix  FMP  is  generated  by  phase  1 and  used  as  input 
to  the  second  phase  of  optimization.  We  observe  that  since 
all  relations  for  which  | Q ^ | = 1 are  discarded  by  the 
second  phase,  NCHS(k)  is  clearly  greater  than  or  equal  to 
2.  BASC  and  BATC  are  given  by  the  following: 


NR 


11  1 

BASC  = l SC (MPL. , j ) *d . , 
j = 1 3 3 

NR 


HA  -i 

BATC  = l TC (MPL . , j)*d. 

j - i 3 3 

i.  -fl  1 

3 lo  o 


where  d 


f IQjl  = i 


otherwise 


and  MPLt  is  the  first  coordinate  of  the  first  choice 
in  Qj. 

We  can  now  identify  two  special  configurations  Cm  and  CM 
which  have  the  smallest  and  largest  storage  costs,  re- 
spectively. 


197 


= {<Rj,  MPL  ^ > | j » 1,  2,  ...  , NR,  R.  t SR, 


MPL , 


first  coordinate  of  the 


first  element  of  } 


C„  = (<R, , MPL . > I j = 1,  2,  ...  , NR,  R.  e SR, 

M J j D 

MPL j = first  coordinate  of  the 

last  element  of  Q.}. 

J 

and  are  also  the  configurations  with  the  smallest 

and  largest  time  costs,  respectively.  Increasing  the 

storage  limit  L above  MMSB  + CS (CM)  does  not  change  the 

optimal  solution,  Cw.  Also  if  L <MMSB  + CS (C  ),  there 

M m 

will  be  no  optimal  solutions.  Note  that  L = MMSB  + CS(C  ) 

o m 

where  Lq  was  discussed  in  Section  4.1.  The  storage  and 

time  costs  of  CM  and  Cm  can  be  computed  from  the  following: 

MR 

CS(C  ) = BASC  + l SC (FMP ( 1 , k) , RL(k) ) 
m k = 1 

MR 

CT(C  ) = BATC  + l TC ( FMP ( 1 , k) , RL (k) ) 
m k = 1 

MR 

CS(CM)  = BASC  + l SC (FMP (NCHS (k) , k) , RL (k) ) 

” k = 1 


MR 

CT  (C  ) = BATC  + l TC  (FMP  (NCHS  (k)  , k)  , RL  (K)  ) 
k = 1 


Let  FSR  be  the  subset  of  the  set  of  data  base  rela- 


tions SR  consisting  of  all  relations  for  which 
| IMP ( j ) I = 1,  i.e.  FSR  = {Rj  I R.  £ SR  and  | IMP ( j ) | = 1}. 
The  optimal  partial  configuration  for  FSR  is  denoted  by 
FPC  and  it  is  given  by 


FPC  = {<Rj,  MPLj  > 


Rj  e FSR  and  MPL ^ is  the  first 


198 


coordinate  of  the  first  element  of  Q^}. 

Clearly  BASC  and  BATC  are  storage  and  time  costs  of  FPC, 
respectively.  The  set  FSR  is  the  set  of  relations  in  SR 
for  which  the  optimal  implementation  is  fixed.  The  com- 
plement of  this  set  with  respect  to  SR,  denoted  VSR  is 
given  by : 

VSR  = SR  - FSR  = (Rj  | j = RL(k) , k = 1,  2,  ...  , MR}. 
For  each  storage  limit  value  L,  the  optimal  configuration 

C for  SR  is  the  union  of  two  sets  FPC,  the  optimal  partial 

* 

configuration  for  FSR,  and  VPC  , the  optimal  partial  con- 
figuration for  VSR.  Where  VPC  is  defined  as  follows: 

CS(VPC*)  5 L - BASC  - MMSB  (4.2.3) 

CT (VPC* ) S CT  (VPC)  for  all  partial  (4.2.4) 

configuration  VPC  for  VSR  such 
that  CS(VPC)  £ L - BASC  - MMSB 


that  for  every  relation  RRL(k) 


The  task  of  the  second  phase  of  optimization  is  to  find 

* 

VPC  , taking  advantage  of  the  fact  (see  Observation  4.2.1) 

, k = 1,  2,  ...  , MR,  only 
NCHS(k)  implementations  are  candidates  for  the  optimal 
choice,  namely  those  associated  with  the  undominated 
choices  in  Q. 


RL  (k ) 


= Qk. 


4.3  Undominated  Solutions 

In  this  section  we  shall  present  the  algorithm  used 
in  the  second  phase  of  the  optimization.  This  algorithm 
takes  the  results  of  the  first  phase  as  input  and  develops, 
as  output,  a series  of  optimal  configurations,  one  for  each 


199 


storage  limit  range.  Before  discussing  the  algorithm, 
however.,  we  shall  define  a few  terms. 

Definition  4.3.1 

Let  SSR^  be  a subset  of  VSR  defined  as  follows: 

SSRk  = ^RRL(p)  I RRL(p)  C VSR  and  P “ 2»  • • • • 

k = 1 * 2 , • • • t MR 

Jr 

Let  a partial  configuration  PC  for  SSR^  be  defined  as 
follows : 

pck  - {<RRL(p)'  MPV  1 rrl(p)  £ SSV 

MPL  = FMP ( i , p)  for  some 
P P 

1 < ip  5 NCHS(p) , 

P = 1 t 2 , ...  , k}, 
k = 1,  2 , ...  t MR 

)c  k 1c 

If  S , T are  storage  and  time  costs  of  PC  , respectively, 

Jr  Jr  Jr 

then  we  shall  call  the  triple  <PC  , S , T > a k-stage 
solution  (or  a solution  to  stage  k) . O 

Note  that  in  Definition  4.3.1,  SSR^  is  a set  composed 
of  the  relations  associated  with  the  first  k stages  of  the 

k 

decision  process  and  PC  is  a partial  configuration  for 
that  set  of  relations  composed  of  relation-implementations 
associated  with  undominated  choices.  It  is  clear  that 

ssrmr  = VSR- 

Definition  4.3.2 

Let  z = <PC,  S,  T>  and  z 1 = <PC',  S',  T'>  be  two 
k-stage  solutions  (for  SSR^)  as  defined  in  Definition 
4.3.1,  such  that  S < S',  then  z is  said  to  dominate  z' 


200 


if  T < T* . If  there  are  two  or  more  k-stage  solutions 
(for  SSR^)  such  that  their  2nd  and  3rd  coordinates  are 
respectively  equal,  then  all  but  one  are  arbitrarily 
selected  as  dominated.  O 

This  definition  is  similar  to  Definition  4.2.2  of 
dominance  between  two  choices  for  a relation.  The  moti- 
vation for  this  definition  is  stated  in  the  following 
observation : 

Observation  4.3.1 

* 

Let  VPC  be  the  optimal  partial  configuration  for  VSR, 

* * 

i.e.  VPC  satisfies  (4.2.3)  and  (4.2.4).  Let  PC  be  a 

* * 

subset  of  VPC  composed  of  the  first  k elements  of  VPC  . 

Then  for  no  k = 1,  2,  ...  , MR,  there  exists  a k-stage 

* * * * 

solution  z = <PC , S,  T>  that  dominates  z = <PC  , S , T > 

and  PC  ? PC*,  where  S*  = CS(PC*)  and  T*  - CT(PC*). 

★ 

Proof : Suppose  z dominates  z , then  by  definition 

CS(PC)  2 CS(PC*)  and  CT(PC)  < CT(PC*).  Let  VPC+  be  the 
partial  configuration  obtained  as  follows: 

VPC+  = (VPC*  - PC*)  U PC 

With  an  argument  similar  to  the  one  presented  in  Observa- 
tion 4.2.1,  it  can  be  shown  that  VPC+  must  have  a smaller 

time  cost  (or  smaller  storage  cost  in  case  of  equal  time 

* 

costs)  than  VPC  . This  is  a contradiction  to  the  as- 
sumption that  VPC+  was  optimal  for  VSR.  □ 

This  observation  justifies  our  desire  to  only  con- 
sider solutions  that  are  not  dominated  by  other  solutions. 


201 


In  fact,  it  can  be  easily  shown  that,  whatever  algorithm 
is  used  to  derive  (k  + l)st  stage  solutions  from  k-stage 
solutions  (k  = 1,  2,  ...  , MR  - i) , dominated  k-stage 
solutions  ’’lead  to"  dominated  (k  + 1) --stage  solutions. 

Definition  4.3.3 

If  there  are  two  k-stage  solutions  z and  z*  such  that 
neither  one  dominates  the  other,  then  z'  and  z are  said 
to  be  a pair  of  undominated  k-stage  solutions.  □ 

The  property  shown  for  undominated  choices  of  a re- 
lation in  Observation  4.2.2  applies  to  undominated  solu- 
tions and  it  is  stated  in  the  following: 

Observation  4.3.2 

If  z = <PC,  S,  T>  and  z'  = <PC' , S',  T'>  are  two  un- 
dominated k-stage  solutions  where  PC  ^ PC' , then  either 
S < S'  and  T > T'  or  S'  < S and  T'  > T.  □ 

Definition  4.3.4 

A set  of  k-stage  solutions  is  said  to  be  an  undominated 
set  of  k-stage  solutions  if  every  pair  of  elements  in  the 
set  is  a pair  of  undominated  k-stage  solutions.  □ 

As  in  the  case  of  a set  of  undominated  choices  for  a 
relation,  the  elements  of  a set  of  k-stage  undominated 
solutions  z = {z  I z = <PC  , S , T > } may  be  assumed, 
with  no  loss  of  generality,  to  be  in  the  increasing  order 
of  storage  costs  and  (at  the  same  time)  in  the  decreasing 
order  of  time  costs.  In  other  words,  if  p = J Z | and 

2i  = <PCi'  si'  Ti>  E z k'  then  for  a11  1 = 2’  •••  r P “ 

■ " PUT— ’’  J ' ■ t*  1 ~>‘  3 


202 


Sk  < Sk  and  Tk  > Tk  . 

i i + l i i + l 

Theorem  4.3.1 


Let  Zk  = {zk  | zk  = <PCk , Sk , Tk>,  1 = 1,  2,  ...  , p} 
be  a set  of  undominated  k-stage  solutions  such  that 


Si  < Si  + i for  i - 1»  2,  ...  , p - 1.  Let 

2k  = <PCk,  Sk,  Tk>  be  another  k-stage  solution  (PCk  / PCk, 

k k k 

1*1,  2,  ...  , p)  such  that  S > S^.  Then  z is  pairwise 

k Jc  if 

undominated  with  i=  1,  2,  ...  , p iff  z and  zp  is  a 

pair  of  undominated  k-stage  solutions.  □ 

The  proof  of  this  theorem  is  analogous  to  the  proof 

given  for  a set  of  undominated  choices  in  Theorem  4.2.1. 

The  following  corollary  follows  from  Theorem  4.3.1: 

Corollary  4.3.1 

Let  Zk  = {zk  I 2k  = <PCk,  Sk,  Tk>,  i = 1,  2,...,  p - 1} 

i 1 i ill  c 

be  a set  of  undominated  k-stage  solutions  with  the  property 

that  Sk  < Sk  + ^(and,  therefore,  Ti  > for  all 

i * 1,  2,  ...  ,p-2.  Let  zk  = <PCk,  Sk , Tk>  be  another 

P P P P 

k k v k 

k-stage  solution  such  that  S > S , . Let  Z be  an  un- 

P P ~ 1 

dominated  set  of  k-stage  solutions  involving  the  maximum 

k k 

number  of  elements  of  Z U {z^},  such  that  for  all 
k k ^ k 

z e Z U {Zp}  - Z , z is  dominated  by  some  z’  e Z . Then: 

i)  Zk  = Zk  U (zk)  iff  Sk  > Sk  and  Tk  < Tk  . 

P P P “ 1 P P-1 

ii)  Zk  * Zk  U {zk}  - {2k  .}  if  Sk  = Sk  . and 

P P ~ 1 P P ” 1 

Tk  < Tk  . 

P P - 1 

iii)  Zk  = Zk  or  Zk  = Zk  U {zK}  - {zk  .}  if  Sk  * Sk  . 

P P~1  P P ~ 1 

and  Tk  = Tk 

P P " 1 


203 


iv)  Zk  = zk,  otherwise.  □ 

As  discussed  earlier,  we  are  interested  in  generating 
k-stage  solutions  for  k = 1,  2,  ...  , MR,  but  retaining 
only  the  maximal  set  of  undominated  solutions  at  each 
stage.  If  k-stage  solutions  are  generated  in  the  proper 
order,  namely  in  increasing  order  of  storage  costs  and  in- 
creasing order  of  time  costs  in  case  of  equal  storage  costs, 
then  it  is  clear  from  the  above  corollary  that  only  one 
comparison  is  necessary  to  dt'cide  whether  to  retain  or 

discard  a generated  solution. 

)c 

Let  Z denote  *-he  set  of  k-stage  undominated  solution, 

Zk  = {zk  I Zk  = <PCk,  ssk,  TT*>,  i = 1.  2,  ,..  , NS(k) } where 
k k 

NS (k ) = | Z | , and  let  Q be  the  set  of  undominated  choices 
for  stage  k (relation  RL(k) ) , 

Qk  = <qk  I qk  = <MPLk,  Sk,  Tk>,  p = 1,  2,  ...  , NCHS(k)} 

. k 

for  k = 1,  2,  ...  , MR.  In  order  to  derive  Z , from 

k • 

Q , k = 1,  2,  ...  , MR,  we  make  use  of  the  following  two 

Observations . 

Observation  4.3.3 

k k 1 

Let  Z and  Q be  defined  as  above.  The  set  Z of  un- 
dominated solutions  to  stage  1 is  directly  derivable  from 
Q1  = <q£  I qj  = <MPLp,  Sp,  Tp>,  p = 1,  2,  ...  , NCHS(l)} 
in  the  following  way: 

NS (1)  = NCHS(l) 

For  i * 1,  ...  , NS(1),  zf  ■ <PcJ-,  S*,  T1^ 

i 1 1 i 

where  PC*  = <<RRL(1)*  MPL*>}. 


204 


Proof ; For  each  1 < i < NS(1),  z}  satisfies  the  re- 
quirements of  Definition  4.3.1  for  k = 1 , therefore,  it 
is  a solution  to  stage  1.  The  claim  that  T?’  composed  of 
all  such  z*,  is  the  maximal  set  of  undominated  solutions 
to  stage  1 follows  directly  from  the  assumption  that 
is  the  maximal  set  of  undominated  choices  for  stage  1 
and  that  SS*  = S*  and  TT*  = tJ,  i - 1,  2,  ...  , NCHS(1).D 

Observation  4.3.4 
)c  k 

Let  Z and  Q be  defined  as  above  for  k = 1,  2,..., MR. 

z = <PC,  SS,  TT>  is  a (k  + 1) -stage  solution  if  for  some 

z.  = <PC. , SS  . , TT. > in  Z and  some  q = <MPL  , S , T > 
l ill  p P P P 

k + 1 

in  Q the  following  is  true: 

PC  . PCi  U «PRL(k  + 1).  MPI,P» 

SS  = SS.  + s 
1 p 

TT  = TT.  + T , 
l P 

1 < i < NS(k) , 1 2 p < NCHS (k  + 1) 

The  proof  of  this  observation  is  also  clear  from  Definition 
4.3.1.  □ 

The  (k  + l)-stage  solutions  obtained  in  this  manner 
are  not,  however,  necessarily  undominated.  For  the  pur- 
pose of  simplicity,  let  N = NS(k)  and  C = NCHS  (k  + 1), 
then  there  are  N*C  solutions  at  stage  (k  + 1) . We  have  al- 
ready established  that  if  these  solutions  are  considered  in 
the  proper  order,  then  it  can  be  decided  whether  or  not  a 

solution  is  dominated  in  only  one  step.  The  task  of  se- 
lection of  (k  + 1)- stage  solutions  in  the  desired  order 


205 


takes  (N*C(N*C  - l))/2  steps.  However,  since  elements 
of  Z are  already  in  the  desired  order,  all  (k  + l)-stage 
solutions  generated  by  the  addition  of  a given  undominate-' 

]r 

choice  for  stage  (k  + 1)  to  elements  of  Z will  be  in  the 
desired  order.  Taking  advantage  of  this  property,  we  can 
reduce  the  number  of  steps  to  NC(C  - 1) . This  is  a con- 
siderable saving  in  time  complexity  as  long  as  N > 2 - ^ , 
which  is  almost  always  the  case  (recall  that  C 2 2) . 

The  process  of  selection  of  (k  + l)-stage  solutions 
in  the  desired  order  using  the  above  method  is  similar  to 
that  of  merging,  at  stage  k,  NCHS (k  + 1)  runs  of  length 
NS(k).  Of  course,  instead  of  retaining  the  whole  merged 
sequence,  only  those  items  associated  with  undominated 
solutions  will  be  retained.  In  that  sense,  more  elaborate 
techniques  developed  for  "multiway  merging"  may  be  used 
when  necessary.  For  example,  if  NCHS(k)  is  usually  larger 
than  8,  some  work  can  be  saved  by  using  a "selection  tree" 
(see  (Knuth  1973]). 

Algorithm  4.3.1  below  generates  all  k-stage  solutions 
for  k = 1,  2,  ...  , MR.  Following  is  a description  of 
some  important  variables  used  in  the  algorithm  : 

MNIT  is  the  maximum  of  NCHS(k)  over 

k = 1,  2,  ...  , MR. 

ITP(i)  i ■ 1,  2,  ...  , MNIT  called  the  Input 

Tape  Pointer,  is  an  array  of  indices 
where  for  1 < CH  < NCHS (k  +1), 


206 


CJfl 

NSC,  NTC 

WINCH,  WINSC , 
WINTC 

NS(k> 

OPL ( i , k) 

POP (i , k) 


ITP(CH)  contains,  at  stage  k,  the  k- 
stage  solution  number  to  be  considered 
with  choice  number  CH  of  stage  k + 1. 
is  the  choice  number,  index  into 
FMP  (CH , k + 1),  1 < CH  < NCHS  (k  + 1). 
are  the  storage  and  time  costs,  respec- 
tively, of  the  (k  + 1) -staqe  solution 
composed  of  k-stage  solution  number 
ITP(CH)  with  the  (k  + l)-stage  choice 
CH. 

are  choice  number,  storage  and  time 
cost,  respectively,  of  the  "winner" 

(k  + 1) -stage  solution, 

k = 1,  2,  ...  , MR,  is  the  number  of  un- 
dominated solutions  to  stage  k. 
k = 1 , 2 , ...  , MR, 

i=l,  2,  ...  , NS(k)  is  the  implemen- 
tation number  (1st  coordinate)  of  the 
optimal  choice  for  stage  k (optimal 
policy)  associated  with  i-th  undomi- 
nated solution  of  stage  k. 
k = 1 , 2 , ...  , MR, 

1*1,  2,  ...  , NS(k)  is  called  the 
Previous  Optimal  Policy  and  contains 
the  index  i'  of  the  (k  - l)-stage 
solution  that  when  combined  with 


207 


implementation  0PL(i,  k)  for 
produces  the  i-th  solution  to  stage  k. 

NSS(k')  k'  = 1 , 2 number  of  undominated  k and 

(k  + 1) -stage  solutions,  respectively. 

OVF ( i , k’)  k'  = 1,  2,  i = 1,  2,  ...  , NSS(k') 

called  the  optimal  (objective)  value 
function;  contains  the  time  cost  of  the 
i-th  undominated  solution  to  stage  k. 

OVS(i,  k')  defined  similar  to  OVF(i,  k')  for  stor- 

age cost  of  undominated  k-stage  solu- 
tions . 

Note  that  the  two  matrices  OVF  and  OVS  have  only  2 
columns  (k * = 1,  2),  rather  than  MR  columns.  This  is  be- 
cause the  storage  and  time  costs  of  undominated  solutions 
of  stage  k are  only  needed  to  produce  those  of  stage  k + 1 
and,  therefore,  only  two  sets  of  pairs  of  storage  and  time 
costs  should  be  maintained  at  each  stage.  Two  variables 
INV  and  OUV  contain  column  indices  into  the  matrix  for  k- 
and  (k  + 1) -stage  solutions,  respectively,  where  INV, 

OUV  e (1,  2}.  The  optimal  policy  at  each  stage  (OPL  matrix) 
should,  however,  be  maintained  for  all  MR  stages.  The  POP 
matrix,  while  not  absolutely  necessary  to  maintain,  facili- 
tates to  a great  extent  the  process  of  backtracking  overall 
optimal  policies  (Algorithm  4.3.2) 

Algorithm  4.3.1 

/*  Generate  one-stage  solutions,  using  Observation 


208 


r 

! 4.3.3  */ 

NSS  (1)  = NCHS  Cl) ; 

INV  = 1;  OUTV  = 2; 

For  CH  = 1 until  NSS(l)  do; 

OVF (CH , INV)  = TC(FMP(CH,  1),  RL(1)); 

OVS (CH , INV)  = SC ( FMP ( CH , 1),  RL(1)); 

OPL (CH , INV)  = FMP (CH , 1) ; 

end; 

/*  Now,  at  stage  k,  generate  solutions  for  stage  */ 

^ /*  k + 1,  using  Observation  4.3.4  */ 

For  k = 1 until  MR  - 1 do; 

OVS (NSS (INV)  + 1,  INV)  = INFINITY; 

/*  End  Marker  = Infinity  */ 

NSS (OUV)  =0;  /*  No  (k  + 1) -stage  solution  yet  */ 
ITP  = 1;  /*  Set  all  Input  Tape  Pointers  to 

l V 

WINNER  = 1;  /*  WINNER  = 1 if  there  exists  a 
winner  choice,  otherwise 
WINNER  = 0 */ 

. While  WINNER  = 1 do; 

WINNER  =0;  /*  No  Winner  yet  */ 

WINTC,  WINSC  = INFINITY;  /*  Initialize  winner's 


storage  and  time 
costs  */ 


209 


For  CH  = 1 until  NCHS (k  + 1)  do; 

NSC  = OVS (ITP (CH) , INV)  + SC(FMP(CHf  k + 1) , 

RL(k  + 1) ) ; 

/*  This  is  the  Next  Storage  Cost  to 
be  compared  */ 

IF  NSC  < MSB  then  do; 

WINNER  = 1;  /*  Indicate  existance  of  a 
winner  */ 

NTC  = OVF ( ITP (CH) , INV)  + TC(FMP(CH,  k + 1)  , 

RL  (k  + 1)  ) ; 

If  NSC  = WINSC  then  do; 

IF  NTC  < WINTC  then  Interchange ; 

IF  NTC  < WINTC  then  Interchange ; 

end ; 

else  If  NSC  < WINSC  then  Interchange ; 

end; 

end ; 

If  WINNER  = 1 then  do; 

ITP (WINCH)  = ITP (WINCH)  + 1;  /*  Advance  The 

Proper  Tape  */ 

IJ!  NSS(OUV)  = 0 then  Add;  /*  Add  1st  one  */ 
else  If  OVF (NSS (OUV) , OUV)  > WINTC  then  Add; 
/*  Add  others  if  undominated  */ 

end; 

end; 

TEMP  = INV;  INV  = OUV;  OUV  = TEMP; 

□ 


end; 


210 


In  this  algorithm  Add  and  Interchange  are  as  follows: 
Interchange:  do;  WINCH  = CH; 

WINTC  = NTC ; 

WINSC  = NSC; 

end; 

Add:  do;  NSS (OUV)  = NSS (OUV)  + 1; 

OVF (NSS (OUV) , OUV)  = WINTC; 

OVS (NSS (OUV) , OUV)  = WINSC; 

I 

OPL (NSS (OUV) , k + 1) 

= FMP (WINCH,  k + 1) ; 

POP (NSS (OUV) , k + 1) 

= ITP (WINCH)  - 1; 

end; 

The  overall  optimal  policies  can  now  be  traced  back 
using  a simple  algorithm  such  as  Algorithm  4.3.2.  It  is 

clear  that  there  are  NSS(INV)  - NS (MR)  optimal  partial  con- 

* 

figuration  VPC  (i) , i = 1,  2,  ...  , NSS(INV)  for  VSR.  The 

i-th  such  configuration  is  optimal  for  the  following  range 

of  values  for  L: 

* * 

C ^ = FPC  U VPC  (i)  is  the  optimal  configuration  for 
all  values  of  L such  that: 

OVS (i , INV)  < L - MMSB  - BASC  < OVS(i  + 1,  INV) . 
Algorithm  4.3.2 

For  i = NSS  (INV)  until  1 step  - 1 do; 

* 

VPC  (i)  = <f> ; /*  Initialize  to  empty  set  */ 

Index  = i; 


211 


For  k = MR  until  1 step  - 1 do; 

VPC* (i)  = VPC*(i)  U {<RRL(k)f  OPL( Index,  k)>}; 
Index  = POP (Index,  k) ; 


end; 

end ; 

The  complexity  of  Algorithm  4.3.1  at  stage  k is  pro- 
portional to  W(k)  = NS (k)  * [NCHS (k  + 1) ] 2 , 


□ 


k=l,  2,  ...  , MR  — 1 . This  does  not  include  the  initial 
step  of  generating  one-stage  solutions  which  has  a trivial 
time  complexity  proportional  to  NCHS(l).  The  total  com- 
plexity is  then  proportional  to: 

MR  - 1 

W = l NS(k)  * [NCHS  (k  + 1)  J 2 
k = 1 


The  cardinality  of  the  set  of  undominated  solutions  for 

stage  k,  NS(k)  ranges  from  NS (k  - 1)  * NCHS(k)  in  the  worst 

case  to  Max (NS (k  - 1),  NCHS(k))  in  the  best  case.  Usually, 

however,  NS (k)  is  on  the  order  of  (NS(k  - 1)  + NCHS(k)). 

Note  that  even  in  the  worst  case  where  W is  proportional 
MR 

to  TT  NCHS(k),  this  number  cannot  be  greater  than  OMSB 


k = 1 


and  in  reality  it  is  much  smaller. 

If  we  assume  NS(k)  to  be  on  the  order  of 
NS(k  - 1)  + NCHS(k),  then  the  following  observation  shows 
that  W is  minimized  when  stages  are  considered  in  the  non- 
increasing order  of  NCHS(k),  i.e.  if  NCHS(k)  > NCHSOc  + 1) 
k = 1,  2,  ...  , 


/ 


MR  - 1. 


212 


Observation  4.3.5 

Let  SP  be  a set  of  permutations: 

SP  = {P  | P = <p1#  p2,  PN>}  is  a permutation  of 

P*  = <1,  2,  . . . , N> } . Let  a2,  , aN  be  real  numbers 

such  that  for  1 < i < N,  a^  > a^  + ^.  Let  P be  a sequence  of 
numbers  obtained  from  P c SP  by  replacing  its  i-th  coordi- 


nate 


p.  with  a : P = < , a^  , . . . , a > . If 
P_-  P-,  Po  Pm 


d 2 

W (P)  = I (a  + a +...+a  ) * a„  , then 

i=l  Pi  p2  Pi  Pi+1 

W(P*)  < W(P)  for  all  P e SP. 

Proof : We  first  show  that  for  any  permutation  P if 

two  adjacent  elements  are  interchanged  in  the  wrong  (cor- 
rect) direction,  then  W(P)  increases  (decreases).  More 
specifically  we  show  that  if 

P = <P^/  • • * • _ ]_/  P^/  Pj_  + P^  + 2'  ' Pn>  an<i 

P*  = “'Pj*  •••  t P^  _ Pj_  + P^»  P^  + 2'  ***  ' Pn>  an<^ 

a > a , then  W(P')  5 W(P).  In  order  to  show  this. 

Pi  Pi  + 1 

we  expand  W(P)  and  W(P')  and  then  form  the  difference 

W(P')  - W(P)  as  follows: 

~ 2 2 2 
W(P)=a  a + (a  +a  )a  +...+(a  +...+a  )a  + 


P1  p2  P1  p2  p3 


Pi-1  Pj 


2 2 
+(a  +...+a  +a  )a  +...+(a  +...+a  )a^ 


Pi-1  pi  pi+l 


PN— 1 PN 


2 2 2 
W(P')  = a a + (a  +a  )a  +...+(a  +...+a  )a  + 


Pi  P2  PX  P2  P3  PX  Pi-i  Pi+1 

2 2 
+ (a  +...+a  +a  )a  +...+(a  +...+a  )a 


Pi-1  Pi+1  Pi 


PN_1  PN 


W(P')-W(P)  = a a?  -a  a*  = a a (a  -a  ) 
pi+l  pi  pi  pi+l  pi  pi+l  pi  pi+l 


This  establishes  the  first  part  of  the  proof  in  both  direc- 


213 


tions , namely  that  i)  if  a > a (wrong  direction) , then 

pi  pi+l 

W(P')  * W(P);  ii)  if  ap  5 ap  (.correct  direction),  then 

W(P')  < W(P)  ; and  iii)  if  a = a , then  W(P')  = W(P)  . 

pi  pi+l 

We  now  have  to  show  that  any  permutation  P e SP  is 

* 

reachable  from  P by  a series  of  interchanges  of  adjacent 

elements  in  the  wrong  direction.  Or  equivalently  from 

* 


any  permutation  P in  SP,  we  can  arrive  at  P by  a series 
of  interchanges  of  adjacent  elements  in  the  correct  direc- 
tion. The  latter  phrasing  of  this  claim  is  constructively 
proved  by  the  application  of  the  "Bubble  Sort"  algorithm 
to  sort  elements  of  P in  the  nonincreasing  order  of  Sp- 
it is  clear  that  i)  all  interchanges  are  between  two  ad- 
jacent elements  and  in  the  correct  direction,  and  ii)  that 


the  final  sorted  sequence  is  P = <1,  2,  ...  , N>,  or  reach- 
able from  it  with  a series  of  interchanges  that  do  not 

change  W(P) . Note  that  W(P')  = W(P)  if  a = a ^ 

pi+l  pi 

The  amount  of  work  saved  by  considering  stages  in  the 
nonincreasing  order  of  NCHS (k)  is  nominal  if  NCHS(k), 
k = 1,  ...  , MR  are  all  on  the  same  order  of  magnitude. 

It  can,  however,  save  a reasonable  amount  of  work  in  cases 
such  as  the  one  illustrated  by  the  following  example: 

W(<20 , 5,  2>)  = 600  (min) 

W (< 20 , 2,  5>)  = 630 
W(<5,  20,  2> ) = 2100 
W(<2,  5,  20> ) = 2850  (max) 

We  emphasize  that  the  above  property  regarding  the 
ordering  of  stages  was  shown  for  the  case  where 


214 


NS(k  + 1)  = NS  (It)  + C0NSTANT*NCKS (k  + 1))  which  is  an 
average  type  of  behavior  expected  for  these  kinds  of  prob- 
lems. It  is  possible  to  prove  this  property  for  more  com- 
plex types  of  relationships  between  NS(k)  and  NCHS(k), 
k = 1,  2,  ...  , MR,  including  the  worst  case  situation 
where  NS (k  + 1)  = NS (k) *NCHS (k  + 1).  The  difference 
W ( P ' ) -W ( P)  in  this  case  is 

ClC2---Ci-l[CiCi+l~(Ci+Ci+l)](Ci-Cn.iJ*  Note  that  the 
sign  of  (W(P')-W(P))  is  the  same  as  (C^-C^+^)  because 

^52  for  all  k = 1,  2,  ...  , MR. 

The  optimization  program  produces  the  solution  to  the 

optimization  problem  stated  in  Section  4.1,  in  two  phases. 

Phase  1 uses  Algorithm  4.2.1  of  Section  4.2  to  generate 

undominated  choices  for  each  relation  in  the  set  of  data 

base  relations.  The  bulk  of  savings  in  processing  time 

is  achieved  by  phase  1.  In  the  case  of  our  application 

example,  after  deleting  the  dominated  choices  for  each 

relation  and  discarding  those  relations  for  which  one  choice 

dominates  all  others,  we  are  left  with  only  6 relations 

(and  hence  € stages  in  the  second  phase)  with  the  following 

number  of  choices  NCHS(k)  at  each  stage  k = 1,  2,  ...  , 6. 

(Note:  MR  = 6) 

k NCHG(k)  k NCHS(k) 

13  4 3 

2 2 5 2 

3 4 6 3 

Table  4.3.1  Maximal  number  of  undominated  choices  for 
stage  k 


215 


Note  that  for  the  above  ordering  of  stages: 

W(<3,  2,  4,  3,  2,  3>)  = 347  which  is  about  15%  greater 

~ * 

than  W(P  ) = 301  for  the  best  ordering  of  stages 
• 

P = <4,  3,  3,  3,  2,  2>.  For  the  worst  ordering  of  stages 

P'  = <2,  2,  3,  3,  3,  4>  the  value  of  W(P')  is  equal  to 

~ * 

405  which  is  in  turn  about  34.5%  greater  than  W(P  ) = 301. 

In  phase  2,  the  optimization  program  uses  Algorithm 
4.3.1  to  find  the  maximal  set  of  undominated  solutions  plus 

necessary  information  to  derive  the  optimal  partial  con- 

★ 

figurations  VPC  (i) , i = 1,  2,  ...  , NS (MR) , using  Algo- 
rithm 4.3.2.  At  the  end  of  this  phase,  the  following 
information  is  available: 

i)  NS  (MR)  = NSS(INV) 

* 

ii)  i = 1,  2,  ...  , NS (MR) , i-tb  optimal 

configuration 

iii)  CS(C^)  = BASC  + OVS ( i , INV)  storage  cost  of 

iv)  CT (C* ) = BATC  + OVF ( i , INV)  time  cost  of  C* 

* 

where  is  the  optimal  configuration  for  all  storage 

limits  L such  that  CStC^)  < L - MMSB  < CS(Ci  + 1).  The  total 

* * 

storage  requirement  of  is  MMSB  + CS(C^), 

* * 

i = 1,  2,  ...  , NS  (MR).  C.  and  CKie  ,MD.  , denoted  C and  CM 

l NS (MR)  m M 

earlier  in  this  section  are  the  optimal  configurations  with 
the  smallest  and  largest  storage  requirements,  respectively. 
For  L > MMSB  + CS (C^) , the  optimal  configuration  is  CM 
and  for  L < Lq  - MMSB  + CS (Cm) , no  optimal  configuration 


exists . 


* 


216 


4.4  Data  Base  Design 

In  this  section  we  discuss  the  method  of  data  base 
design  given  a configuration  C for  the  set  of  data  base 
relations  SR.  More  specifically,  we  want  to  derive  the 
data  base  design  in  terms  of  record  and  set  declarations. 

We  recall  that  a configuration  C for  SR  is  a set  of  rela- 
tion implementation  pairs,  <R^  , MPL^  >,  j = 1,  2,  ...  , NR, 
consisting  of  exactly  one  element  for  each  relation  in  SR. 
Each  MPL j , j = 1,  2,  ...  , NR  is  a number  from  1 to  24 
designating  an  implementation  number.  As  defined  in  Chap- 
ter 3,  implementations  1,  2,  3,  and  4 are  duplications  and 
implementations  5,  6,  7,  and  8 are  aggregations.  Imple- 
mentations 9 to  20  are  associations.  When  used  to  imple- 
ment a relation  such  as  R(A,  B)  where  A = ORG(R)  and 
B = DST(R),  odd  (even)  numbered  implementations  duplicate, 
aggregate  or  associate  (whichever  is  appropriate),  B(A) 
under  A(B) . Implementation  21  for  R(A,  B)  is  the  dummy 
record  association  of  A and  B and  the  last  three  imple- 
mentations (22,  23  and  24)  for  R(A,  B)  are  called  single 
linkage  of  B under  A,  single  linkage  of  A under  B and 
double  linkage  of  A and  B,  respectively. 

The  design  of  the  data  base  takes  place  in  three  steps. 
In  the  first  two  steps  intra-record  relationships  (dupli- 
cations and  aggregations)  are  considered  and  thereby  all 
record  types  and  their  components  are  determined.  In  the 
third  step,  inter-record  relationships  (associations  and 


217 


« 


i 


' 


i 

! 


linkages)  are  considered  to  determine  data  base  set  decla- 
rations. 

In  step  1,  we  start  by  identifying  all  data  items  A 
in  the  set  of  data  items  SI  that  are  not  aggregated  under 
some  other  data  item  type  B e SI.  In  other  words,  given 
a configuration  C for  SR  such  as 

C = { <R^ , MPL^>,  ...  , <RNR,  mplnr>^  we  are  looking  for 
all  data  items  I e SI  such  that  for  each  j = 1,  2,  ...  , NP; 


1) 

if 

DST (R j ) 

= I , then 

<Rr 

5> 

t 

C 

and  <R j , 

7> 

i 

c 

2) 

if 

DRG (Rj ) 

= I , then 

<Rr 

6> 

t 

C 

and  <R j , 

8> 

t 

c 

One  record  type  will  be  defined  in  the  data  base  definition 
for  each  such  data  item  type  and  the  data  item  type  will  be 
called  the  "principal  item"  of  the  record  type.  Number  of 
occurrences  of  the  record  type  is  equal  to  the  cardinality 
of  that  data  item  type.  The  name  of  the  record  type  will 
be  formed  by  concatenating  the  string  "REC"  to  the  right 
of  the  first  3 characters  of  the  name  of  the  data  item. 

For  example  consider  configuration  C',  given  below,  whose 
data  base  design  is  given  in  Figure  5.8  in  Chapter  5. 


{<RXf 

5>, 

<R2,5> 

<R4 

f 5> , 

<R5 

, 5 > , <Rg , 7 > , 

<R^  , 

5>, 

<R8'5> 

f ^Rg  f , 

<R10-9>, 

<R 

11'3>' 

<R12 

,12> 

' <R13 

,5>,  <R^4 

,5>, 

<R15 

/ 5> 

/ <R^g / 5 > , 

<R17 

,5>, 

<R18 ' 

5>, 

<R20 ' 

5>, 

<r2i  * > 

<R22 

,9>, 

<R2  3 ' 

21>  t <R24 

,5>, 

<R25 

i 5> 

, <R26,13>) 

Name,  origin  data  item  type  and  destination  data  item  type 
of  each  relation  R1#  ...  , R26  are  given  in  Chapter  2.  We 


218 


note  that  data  item  1^  called  EMPNUM  participates  in 
relation  to  as  origin  and  in  10,  22  and  23  as 

destination  data  item  type.  Given  the  set  of  relation 
implementations  in  C' , we  observe  that  EMPNUM  is  not 
aggregated  under  any  other  data  item  type  and,  therefore, 
a record  type  with  unique  EMPNUM  values  in  its  occurrences 
will  be  defined  and  will  be  called  EMPREC  (see  Figure  5.8). 
Similarly,  data  item  types  called  ORGCOD,  JOBCOD,  OFRCOD 
and  COUNUM  are  not  aggregated  under  any  other  data  item 
type,  therefore,  four  more  record  types  will  be  defined 
and  called  ORGREC,  JOBREC,  OFRREC  and  COUREC,  respectively. 

In  the  second  step  of  data  base  definition,  we  de- 
fine one  component  vector  or  repeating  group,  at  level  one 
of  record  definition,  for  each  aggregation  or  duplication 
of  some  other  data  item  type  with  the  principal  item  of 
the  record.  A component  vector  will  be  defined  for  each 
duplication  of  some  other  data  item  under  the  principal 
item  and  each  aggregation  of  some  other  data  item  if  no 
other  data  item  is  aggregated  or  duplicated  under  the 
latter.  In  the  case  where  another  data  item  is  aggregated 
or  duplicated  under  the  second  data  item  (the  one  aggre- 
gated under  the  principal  item) , then  a repeating  group  is 
defined  and  the  second  data  item  type  becomes  its  principal 
item.  Similarly,  for  any  data  item  type  duplicated  or 
aggregated  under  the  principal  item  of  a repeating  group, 
a vector  or  a repeating  group  at  one  level  lower  (higher 


219 


' 


level  number)  than  the  original  repeating  group  will  be 
defined.  This  allows  arbitrarily  deep  levels  of  nested 
repeating  group  definitions.  In  the  case  of  configura- 
tion C'  for  SR,  the  vector  called  JOBCOD  in  EMPREC  repre- 
sents a duplication  of  data  item  JOBCOD  under  EMPNUM  to 
implement  relation  R^  called  JHSOFEMP  (job  history  of 
employee)  using  implementation  3.  The  counter  data  item 
JHSCNT  contains  the  length  of  the  vector.  As  an  example 
of  a repeating  group  in  the  same  configuration,  we  observe 
that  ATHSAL  is  (variably)  aggregated  under  JOBCOD  to  im- 
plement SALOF JOB ; see  pair  <R, , 7>  e C' . At  the  same 

D 

time  ATHMAX,  ATHMIN  and  DDUCTN  are  aggregated  under  ATHSAL 
to  implement  MAXOFSAL,  MINOFSAL  and  DDCOFSAL,  respective- 
ly; see  <R7,  5>,  <Rg,  5>  and  <Rg , 5>  in  C' . Therefore, 
a variably  dimensioned  repeating  group  called  SALRPG  is 
defined  in  TOBREC  whose  dimension  is  given  by  the  counter 
data  item  SALRGC.  Data  item  type  ATHSAL  is  the  principal 
item  of  the  repeating  group  and  ATHMAX,  ATHMIN  and  DDUCTN 
are  its  components  and  DDUCTN  is  a vector  of  a fixed 
length,  (see  Figure  5.8).  This  completes  the  formation 
of  record  types. 

In  step  three  of  data  base  definition,  we  consider  all 
association  and  linkage  implementation  to  derive  all  inter- 
record relationships  in  the  data  base.  First  for  each 
<Rj,  MPLj>  e C such  that  MPL^  is  an  odd  (even)  number  from 
9 to  20,  we  define  a data  base  set  type.  The  name  of  the 


220 


data  base  set  type  is  the  same  as  the  name  of  R^ . Its 
owner  record  type  is  the  record  type  for  which  ORG(Rj) 

(DST (Rj ) ) is  the  principal  item.  The  member  record  type 
of  the  set  type  is  the  record  type  for  which  DST(R^) 

(ORG ( R j ) ) is  the  principal  item.  The  following  should  be 
specified  as  OPTIONS  in  the  definition  of  the  set  type: 

CHAIN  if  MPL j = 9 or  10 

CHAIN  WITH  OWNER  POINTERS  if  MPL . = 11  or  12 

3 

CHAIN  WITH  PRIOR  POINTERS  if  MPL.  = 13  or  14 

3 

CHAIN  WITH  OWNER  AND  PRIOR  POINTERS  if  MPL.  = 15  or  16 

3 

POINTER  ARRAY  if  MPL.  = 17,  18 

3 

POINTER  ARRAY  WITH  OWNER  POINTERS  if  MPL.  = 19  or  20. 

3 

For  example  in  the  case  of  configuration  C'  given 
above  since  <R12,  ^ > e C',  a data  base  set  type  called 
JOBOFEMP  is  defined.  The  owner  and  member  record  types 
of  the  set  type  are  JOBREC  and  EMPREC,  respectively,  and 
CHAIN  WITH  OWNER  POINTERS  has  been  specified.  Recall  that 
Rj  2 • ORG(R12)  and  DST(R12)  are  called  JOBOFEMP,  ORGCOD 
and  EMPNUM,  respectively.  Similarly,  there  are  four  other 
data  base  set  types  to  be  defined  for  R^,  R^o ' ^22  anc*  R26 
(see  Figure  5.8). 

Next  we  define  one  (dummy)  record  type  and  two  data 
base  set  types  for  each  <R^,  21>  e C.  The  dummy  record 
type  is  the  common  member  record  type  of  both  set  types 


and  ORG(Rj)  is  the  owner  record  type  of  one  set  type  while 
DST(Rj)  is  the  owner  of  the  other  set  type.  The  OPTION 


specified  for  both,  set  types  is  always  CHAIN  WITH  OWNER 
POINTERS.  Example  of  this  is  <^23'  21>  e w^ere  the 
dummy  record  type  STULNK  (student  link)  is  defined  (Figure 
5.8)  as  the  6-th  record  type  and  two  data  base  set  types 
STUDNTOF  and  STUDENTS  are  defined  as  the  6-th  and  7-th 
data  base  set  types  in  the  data  base  definition. 

Finally  for  each  <R^  , MPL^>  e C such  that  MPL ^ = 22, 

23  or  24,  a fixed  length  sector  of  data-base- 
keys  (pointers)  is  defined  in  the  record  type  for  which 
ORG(Rj)  is  the  principal  item  if  MPL^  =22,  in  the  record 
type  for  which  DST (R^ ) is  the  principal  item  if  MPL ^ = 23, 
and  in  both  if  MPL^  = 24. 

This  completes  the  three  steps  of  data  base  (defini- 
tion) design  from  a given  configuration.  Several  data 
base  designs  are  given  in  Chapter  5 where  each  one  is 
optimal,  under  certain  conditions,  for  the  application 
example  of  Chapter  2.  In  the  following,  we  give  the  con- 
figurations corresponding  to  data  base  designs  of  Figures 
5.1  to  5.6  of  Chapter  5.  In  order  to  simplify  the  presenta- 


tion, we  will  first  give  the  common  subset  of  all  the  six 

configurations  as  FPC.  Then  C|,  ...  , Cg  the  configurations 

corresponding  to  Figures  5.1  to  5.6,  respectively,  are 

given  by  FPC  U VPC^  ...  , FPC  U VPCg,  where 

VPC. , ...  , VPC,  and  FPC  are  as  follows : 

1 o 

FPC  - {<R.,5>,  <R2,5>,  <R4,5>,  <R5,5>,  <Rg,7>,  <R?,5>, 
<Rp , 5> , <Rq  » 5> , <R. . , 3^ , <R.«,5>,  <R.,,5>, 


222 


<R17*5>, 

<R18 ' ^ > 

<R24 ' ' 

<R25,5>}, 

VPC^ 

= 

{ <R^ , 9> , <R^q  f 9 > r 

< R^ 2 , , 

<R22,3>,  <R23,3>, 

<R26 ' 2>  ^ ' 

VPC2 

= 

{<R3,9>,  <R1Qf9>, 

<r12,i>. 

<R22'3>'  <R23,4>, 

<1*26  9 ^ ^ ^ 9 

VPC3 

= 

{ <R^ , 9>  , <R^q#9>, 

<R12, 10> , 

<R22'9>'  <R2  3 ' 3>  ' 

<R26'9>}' 

VPC  . 

4 

= 

{<R3,9>,  <R1q,9>, 

<R^2  r 12> , 

<r22*9>»  <R23 ' 3>  * 

<R26,13>}, 

• / 

VPC5 

= 

{<R3,9>,  <R1C,9>, 

<R12/4>' 

<R22,3>,  <R23 r 3>  9 

<R26,13>}, 

VPCe 

= 

{<R3>3>/  <R,3q,3># 

<R12,4>, 

<**22'  ' <R23'  ^>r 

<R26,2>>. 


IS 


. 

I 


CHAPTER  V 


RESULTS 


5.1  Sensitivity  of  Time  Cost  Functions 

We  developed  the  time  cost  functions  for  relation 
implementations  in  Chapter  3.  Time  costs  of  relation 
implementations  are  evaluated,  using  these  functions,  and 
the  resulting  values  are  used  as  the  respective  second  cosc 
components  of  the  relation  implementations.  The  first  cost 
component  associated  with  implementing  a relation  by  a 
specific  implementation  is  the  storage  cost  of  the  relation 
implementation.  Formulas  for  computing  storage  costs  were 
also  discussed  in  Chapter  3.  In  Chapter  4 we  used  this  two- 
component  cost  measure  to  design  a data  base  for  a given 
application . 

Time  costs,  measured  in  the  expected  number  of  page 
faults,  were  expressed  as  functions  of  four  page  fault 
probabilities  p^,  p£,  p^  and  p^ , defined  in  Section  3.4. 

We  repeat  the  expressions  for  these  probabilities,  for 
convenience,  regarding  relation  R ^ (A, B)  e SR,  j = 1,  NR. 


P1  1 TMSB 


e2(V 

1 _ 

PBSZ 

if  PBSZ  < 

*ARL 

= 

0 

otherwise 

e2<V 

1 - 

PBSZ 

if  PBSZ  < 

a'.  *ARL 
3 

3 

0 

otherwise 

Pjiiy 

1 _ 

PBSZ 

if  PBSZ  < 

1 

0 

(aV+B ! ) *ARL 
3 3 

otherwise 

B^*ARL 

a ’ *ARL 
3 

(a'.  + f ) *ARL 
3 3 


223 


224 


Where  in  the  above,  PBSZ  is  called  the  page  buffer  size 
given  by: 

PBSZ  = MSPG*PGSZ, 

ARL  is  the  Average  Record  Length,  TMSB  is  the  Maximum 
Storage  Bound,  and 

a j = | ORG  (R  j ) | , 6*  = |DST(RJ|. 

The  above  expressions  for  p£,  p^  and  P3  hold  for  all 
implementations  except  for  implementation  21  for  which  the 
exact  expressions  were  given  in  Section  3.4.  For  example 
in  the  expressions  for  p2<Rj),  in  the  case  of  implementa- 
tion 21,  fij  should  be  replaced  by  r^  which  is  the  cardin- 
ality of  Rj  and  ARL  should  be  replaced  by  DRSZ  which  is  the 
size  of  a dummy  record. 

As  can  be  seen,  in  most  cases  P2  and  p^  depend  on  the 
average  record  length  ARL,  a quantity  which  is  not  exactly 
known  at  the  beginning  of  the  optimization  process.  Time 
costs  of  relation  implementations,  defined  in  terms  of 
these  probabilities,  are  computed  using  an  estimated  ARL  to 
derive  the  "optimal"  configuration.  We  recall  from  Chapters 
3 and  4 that  a configuration  C for  a set  of  data  base  rela- 
tions SR = {R^ , ...,  R j , ...,  Rnr}  is  a set  of  ordered  pairs 
of  relation  implementations: 

C = (<RX,  MPL x > , ...,  <R_.,  MPL_.>,  <Rnr,  MPLnr>} 

where  MPL ^ , j = 1,  ...»  NR  is  an  implementation  number  (an 
integer  from  1 to  24)  designating  one  of  the  24  implemen- 
tation alternatives  defined  in  Chapter  3.  An  acceptable 
configuration  C for  SR  is  defined  as  follows: 


225 


C = (<  R j , MPL  j > | j = 1,  . ..,  NR  and  AC  (MPL^ , j)  = 1 

for  all  j). 

The  (cumulative)  storage  cost  of  an  acceptable  configuration 
C is  the  sum  of  storage  costs  of  individual  relation  imple- 
mentations in  the  configuration  and  is  given  by, 


NR 

CS(C)  = l SC (MPL.,  j)  . 
j = l ^ 

The  (cumulative)  time  cost  of  an  acceptable  configuration  C 
is  similarly  defined  as  follows, 


NR 

CT  (C)  = E TC  (MPL  . , j) 
j=l  3 

The  "optimal"  configuration  C*  for  a set  SR  of  data  base 
relations  is  an  acceptable  configuration  such  that 
CT (C* ) < CT (C)  and 
CS(C*)  < TMSB 

for  all  acceptable  configurations  C for  which  CS(C)  < TMSB. 

In  Chapter  4 we  developed  an  efficient  algorithm  that 
finds  C*.  The  fact  that  an  estimated  ARL  value  was  used  to 
evaluate  time  cost  functions  has  the  following  two  impli- 
cations . 


(1) 


(2) 


Time  cost  values  are  approximate  values  because 
the  ARL  of  the  resulting  configuration  was  not 
used  to  derive  its  own  time  cost. 

Let  CT  (C^ , £.  ( C j ) ) denote  the  time  cost  of  config- 
uration C^  evaluated  using  the  average  record 
length  of  configuration  Cy  When  a configuration 
C i is  selected,  as  optimal,  by  the  optimization 


-v 


226 


algorithm  of  Chapter  4,  using  ARL  = l , 
another  configuration  C 2 may  exist  for  which 
CT(C2,  4(C2))  < CT(C1>  H(C1))  and 

CT  (C_ , £ ) > CT(C, , £ ) . 

2 o 1 o 

If  case  (2)  occurs,  configuration  C2  will  be  missed 
(will  not  be  selected)  by  the  optimization  process  because 
it  looked  inferior  to  when  time  costs  were  evaluated 
using  ARL  = £ . In  the  following  we  show  that  the  percen- 
tage variation  of  time  costs,  when  ARL  is  varies,  is  small 
enough  so  that  the  chosen  configuration  (C^)  is  within 
reasonable  proximity  of  the  optimal  one  (C2)  if  the  latter 
was  missed. 

The  time  cost  of  an  implementation  MPL  for  a relation 
R in  a typical  operation  0 is  of  the  following  form. 

TC  (MPL , R,  0)  = + y2P2  + W3P3  + P4P4} 

In  this  expression  p2  and  p^  are  dependent  on  the  average 
record  length  ARL.  We  define  the  percentage  variation  of 
time  costs,  TC,  for  a variation  A£  in  average  record  length 

ARL  as  follows. 

_ ATC 

°£  ~ TC 

Using  the  above  expression  for  TC  we  get: 

4 

E W.Ap. 

ATP  n =1  A 

6.  = Ap.  = variation  of  p.  for  a A£ 

£ TC  4 *z  *1 

E Uf  p^  variation  in  , i=l,..,4 

And  since  p^  and  p4  are  independent  of  ARL;  Ap^  = Ap4  = 0 
and  we  have  the  following. 


227 


{£  < 


y2Ap2  + y3Ap3 
M2p2  + W3P~ 


(5.1.1) 


In  the  above  expression  1 - y-j  should  be  substituted 
for  p^,  i = 2,  3,  where  k is  the  page  buffer  size  PBSZ 
given  by:  k = PGSZ  = MSPG*PGSZ,  and  £ is  the  average  record 
length.  For  a specific  relation  Rj(A,B),  y 2 and  y 3 are 
given  by  the  following  (see  Chapter  3,  Section  4). 

Y2  = Pj  for  p2  (Rj ) 

Y2  = for  p£ (Rj) , and 

y = a'  + B(  for  p,(R.).  (5.1.2) 

J J J ^ J 


The  right  hand  side  of  (5.1.1)  is  less  than  or  equal 
to  Ap2/p2  if  Ap3/p3  < Ap2/p2.  It  is  clear  from  (5.1.2) 
that  Ap3/p_  is  less  than  Ap2/p2  and  hence  the  following  is 


true: 


6 m ATC  P2Ap2  + y3Ap3 
£ TC  < y2P2  + M3P3 


Ap^ 

P2 


k k A £ 

Substituting  1 - — — ^ for  p2  and  (-  — —j)  for  Ap2  we  con- 
clude that  6^  < ((k/Y2y(£-k/Y2))  • ^ . Using  the  average 
cardinality  y of  all  data  items  in  SI  for  y2  we  arrive  at 
the  following  definition  of  the  error  estimate  <5^. 


6 g JS/JL.  . M 

£ £-k/ y £ 

In  a similar  manner  we  estimate  the  error  introduced 
in  time  cost  evaluations  by  using  TMSB  in  the  computation 
of  p3  instead  of  the  true  storage  requirement,  S,  of  the 
optimal  data  base.  The  formula  for  p^ , discussed  in  Section 
3.4,  is  given  below. 


pl 


PBSZ 


1 


TMSB 


228 


The  percentage  variation  of  time  cost  for  a variation  AS  in 

storage  requirement  S is  defined  as  follows. 

r _ ATC  = ^fl 
s TC  Pl 

Jr 

Substituting  1 - ^ for  P^_  where  k = PBSZ  = MSPG*PGSZ  and  S 

is  the  storage  requirement  for  the  data  base,  we  get  the 

following . 

x _ k/S  AS 
s 1-k/S  * S 

In  order  to  estimate  errors  in  the  worst  case  we  use 
the  highest/lowest  possible  value  for  LI/ l in  the  computa- 
tion of  6^  and,  similarly,  the  highest/lowest  possible 
value  of  AS/S  in  the  computation  of  <S  . 

The  optimization  program  of  Chapter  4 was  run  for  the 
example  application  introduced  in  Chapter  2 for  various 
<MSPG,  PGSZ>  pairs.  For  each  such  pair  the  value  of  ARL 
was  varied  over  a range  of  50  to  600  characters.  Values  of 
6^  and  6g  were  computed  for  each  run  and  they  are  tabulated 
in  Tables  5.1  to  5.8. 

For  the  above  runs  the  values  of  other  parameters  of 
the  model  were  sot  as  follows;  size  of  an  address  occurrence 
(pointer  size)  PTRC  = 4 characters,  size  of  a counter 
occurrence  CNTRS  = 6 characters,  size  of  a dummy  record 
DRSZ  = 4 *PTRS  = 16  characters  and  TMSB  = 1,600,000  charac- 
ters. 

As  can  be  seen  in  the  tables,  eight  different  pairs 
< MSPG,  PGSZ> , of  Main  Storage  Pages/Page  Sizes  were  con- 
sidered where  MSPG  =1,  2,  4 and  5 pages  and  PGSZ  = 2000 


229 


ARL 

£ 

max 

£ 

min 

A£ 

6£% 

50 

260.4 

211.5 

48.9 

2.12 

.55 

100 

260.6 

211.5 

48.9 

2.12 

.55 

200 

260.6 

211.5 

49.1 

2.12 

.55 

400 

260.6 

211.5 

49.1 

2.12 

.55 

600 

260.6 

211.5 

49.1 

2.12 

.55 

Table 

5.1 

Time  cost 
MSPG  = 5, 

error 
PGSZ  = 

estimates 
= 4000 

for 

ARL 

£ 

max 

£ . 
min 

A£ 

6£% 

6 % 
s 

50 

260.4 

211.5 

48.9 

1.66 

.44 

100 

260.4 

211.5 

48.9 

1.66 

.44 

220 

260.6 

211.5 

49.1 

1.67 

.44 

600 

260.6 

211.5 

49.1 

1.67 

.44 

Table 

5.2  Time  cost 
MSPG  = 4, 

error 
PGSZ  = 

estimates 
• 4000 

for 

ARL 

£ 

max 

£ , 
mm 

A£ 

6£% 

6 % 
s 

50 

260.4 

211.5 

48.9 

.8 

. 33 

100 

260.6 

211.5 

49.1 

.8 

.33 

220 

260.6 

211.5 

49.1 

.8 

. 33 

600 

260.6 

211.5 

49.1 

.8 

.33 

Table 

5. 3 Time  cost 
MSPG  = 2, 

error 
PGSZ  = 

estimates 

4000 

for 

ARL 

£ 

max 

£ . 
mm 

A£ 

6£% 

6 % 
s 

50 

260.6 

211.5 

49.1 

.4 

.11 

220 

260.6 

211.5 

49.1 

.4 

.11 

600 

260.6 

211.5 

49.1 

.4 

.11 

Table  5.4  Time  cost  error  estimates  for 
MSPG  = 1,  PGSZ  = 4000 


230 


ARL 

£ 

max 

£ 

min 

A £ 

6£% 

6 % 
S' 

50 

260.5 

211.5 

49.0 

1.02 

.27 

100 

260.6 

211.5 

49.1 

1.02 

.27 

220 

260.6 

211.5 

49.1 

1.02 

.27 

400 

260.6 

211.5 

49.1 

1.02 

.27 

600 

260.6 

211.5 

49.1 

1.02 

.27 

Table 

5.5 

Time  cost 
MSPG  = 5, 

error 

PGSZ 

estimates 
= 2000 

for 

ARL 

£ 

max 

£ 

mm 

A£ 

6£% 

6 % 
s 

50 

260.5 

211.5 

49.0 

.8 

.33 

100 

26C  . 6 

211.5 

49.1 

.8 

. 33 

220 

260.6 

211.5 

49.1 

.8 

.33 

400 

260.6 

211.5 

49.1 

.8 

.33 

600 

260.6 

211.5 

49.1 

.8 

.33 

Table 

5 . 6 

Time  cost 

error 

estimates  for 

MSPG  = 4, 

PGSZ  ’ 

= 2000 

ARL 

£ 

max 

£ . 

min 

A £ 

6£% 

6 % 
s 

50 

260.6 

211.5 

49.1 

.4 

.11 

100 

260.6 

211.5 

49.1 

• .4 

.11 

220 

260.6 

211.5 

49.1 

.4 

.11 

400 

260.6 

211.5 

49.1 

.4 

.11 

600 

260.6 

211.5 

49.1 

.4 

.11 

Table 

5.7 

Time  cost 
MSPG  = 2, 

error 
PGSZ  = 

estimates  for 
= 2000 

ARL 

£ 

max 

£ 

mm 

A £ 

6£% 

6 % 
s 

50 

260.6 

211.5 

49.1 

.13 

.054 

100 

260.6 

211.5 

49.1 

.13 

.054 

220 

260.6 

211.5 

49.1 

.13 

.054 

400 

260.6 

211.5 

49.1 

.13 

.054 

600 

260.6 

211.5 

49.1 

.13 

.054 

Table  5.8  Time  cost  error  estimates  for 
MSPG  = 1,  PGSZ  = 2000 


231 


and  4000  characters.  For  each  pair,  the  program  was  run 
for  some  ARL  values  such  as  50,  100,  220,  400  and  600 
characters.  The  average  record  lengths  of  all  configura- 
tions obtained  in  each  of  those  cases  (for  all  possible 

storage  limits)  were  on  the  range  from  Z . to  Z charac- 

mi  n ma  x 

ters.  The  values  of  £.  . , Z and  AJl  = Z - Z . are 

nun'  max  max  mm 

given  in  the  Tables. 

For  the  application  example  of  Chapter  2 with  24  data 
items,  26  relations  and  4 run  units,  the  average  cardinality 
of  data  items  is:  y = 27,0"70/24  - 1128,  and  this  value 
is  always  used  for  y.  As  can  be  seen  from  Tables  5.1  to 
5.8,  6^  is  always  greater  than  6g.  This  is  because  the 

size  PBSZ  of  the  page  buffer  is  a larger  percentage  of  the 
size  of  an  average  file  than  a percentage  of  the  total  size 
of  the  data  base.  <5^  varies  from  .13%  for  PBSZ  = 2,000 
characters  (Table  5.8)  to  2.12%  for  PBSZ  = 20,000  characters 
(Table  5.1),  while  varies  from  .054%  for  PBSZ  = 2,000 
characters  to  .55%  for  PBSZ  = 20,000  characters.  The 
average  record  lengths  of  the  resulting  configurations  vary 
between  211.5  and  260.6  characters,  which  accounts  for  a A Z 
of  about  49  characters. 

Very  large  values  of  PBSZ  result  in  high  error  esti- 
mates especially  in  <5^.  However,  the  value  of  AZ  does  not 
change  very  much.  For  example,  for  an  extreme  case  of 
MSPG  = 25  pages  with  PGSZ  = 4,000  characters  (PBSZ  = 100,000 
characters),  6^  can  be  as  high  as  26.6%  while  Ail  is  between 
48.9  and  49.7  characters  (Table  5.9). 


232 


ARL 

l £. 

max  min 

LI 

6*% 

6S% 

50 

230.0  180.4 

49.7 

26.6 

2.94 

220 

260.4  211.5 

48.9 

16.69 

2.94 

600 

260.4  211.5 

48.9 

16.69 

2.94 

Table 

5.9  Time  Cost 
MSPG  = 25, 

Error 

PGSZ 

Estimates 
= 4,000 

for 

Results  given  in  Tables  5.1  to  5.8  also  show  that  the 

values  of  J l and  l . have  very  small  variations  with 
max  min 

respect  to  changes  in  ARL  and  PBSZ.  The  value  of  SL  in 
^ max 

the  above  results  is  between  260.4  and  260.6  characters  and 
J^min  *s  211.5  characters.  The  optimization  program  was  run 
for  all  < MSPG,  PGSZ>  pairs,  such  that  MSPG  = 1,  2,  4,  5 
and  PGSZ  = 2,000,  4,000,  with  ARL  set  to  211.5  and  260.6 
characters  for  each  pair.  The  results  of  these  16  runs 
indicate  that  the  average  record  lengths  of  optimal 
configurations  (for  different  storage  limits)  are  still 
between  211.5  and  260.6  characters.  The  cumulative  storage 
costs  of  the  optimal  configurations  are  between  1,207,050 
and  1,485,350  characters.  Storage  limits  set  at  values 
greater  than  1,485,350  do  not  result  in  any  new  configura- 
tions, i.e.  the  conf iguration  corresponding  to  a limit  of 
1,485,350  characters  is  optimal  for  any  storage  limit 
greater  than  or  equal  to  1,485,350  characters.  For  storage 
limit  values  less  than  1,207,050  characters  no  optimal 
solution  exists  because  there  is  not  enough  storage  avail- 
able to  implement  the  data  base  for  that  particular  appli- 


cation. 


233 


It  is  evident,  from  the  definition  of  <5.  and  6 , that 

x»  s 

these  error  estimates  present  the  percentage  of  possible 
errors  made  in  the  evaluation  of  the  time  cost  of  relation 
implementations.  The  preceding  results  show  that  for 
reasonable  PBSZ  values  (2,000  to  20,000  characters)  these 
errors  are  on  the  order  of  a few  percentage  points.  It  is 
important,  however,  to  observe  that  at  the  same  time  the 
cumulative  time  costs  CT(CM)  and  CT(Cm)  of  configurations 
CM  and  Cm  with  highest  and  lowest  time  costs,  respectively, 
are  relatively  very  far  apart.  For  example,  when  MSPG  = 1, 
PGSZ  = 2,000  and  ARL  = 260.6,  CT„  and  CT  are  equal  to 
135,673  and  52,553,  respectively,  (in  expected  page  fault 
rate),  for  a difference  of  about  88.3%  computed  from 

100  (CT  -CT  )/((CTm+CT  )/2). 

M nr  Mm 

Given  the  results  stated  above,  concerning  the  approx- 
imations in  time  cost  values,  we  know  that  the  errors  intro- 
duced as  a result  of  using  an  estimated  ARL  value  are 
reasonably  small.  Further  we  observed  that  the  average 
record  lengths  of  the  outcoming  data  base  designs  stay 
within  an  almost  fixed  boundary.  In  our  example  the  range 
(?<min'  ^max^  was  (211.5,  260.6)  which  changed  to  (211.5, 
260.5)  or  (211.5,  260.4)  in  some  cases.  We  have  also 
demonstrated  that  adopting  an  estimate  for  ARL  within  this 
boundary  results  in  only  a few  percentage  error  in  time 
costs . 

The  overall  design  method  we  use,  then  is  (1)  to 
execute  the  optimization  program  for  a wide  range  of  ARL 


234 


values,  (2)  compute  the  interval  (H  . , £ ) in  each 

mm  max- 

case  and  (3)  adopt  an  estimate  for  ARL  in  the  intersection 
of  all  these  intervals.  For  the  data  base  designs  discussed 
in  the  next  section  we  use  an  average  record  length  estimate 
of  211.5  characters. 

5.2  Discussion  of  Resulting  Data  Base  Designs 

In  this  section  we  shall  discuss  several  alternative 
data  base  designs  for  the  application  example  introduced  in 
Chapter  2.  Each  alternative  is  optimal  under  specific 
values  for  parameters  of  the  model. 

For  example,  in  the  case  of  the  following  values  of 
parameters,  the  data  bases  shown  in  Figures  5.1  to  5.6  are 
obtained. 

PTRS  (size  of  a pointer)  4 characters 

CNTRS  (size  of  a counter)  6 characters 

PGSZ  (size  of  a page)  2,000  characters 

PLOKS  (size  of  a privacy  lock)  20  characters 

MSPG  (main  storage  pages)  2 pages 

ARL  (Average  Record  Length)  211.5  characters 

TMSB  (Maximum  Storage  Bound)  1,600,000  characters 
Figure  5.1  shows  the  data  base  description  which  is 
optimal  for  the  above  values  of  parameters  and  a storage 
limit  of  1.35  million  characters.  The  actual  storage 
requirement  of  that  data  base  is  1,348,350  characters  and 
the  expected  page  fault  rate  for  the  data  base  is  54,694. 

The  data  base  shown  in  Figure  5.1  contains  five  record 


235 


Record  Descriptions: 


NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

ORGREC 

/♦Organization  Record 

V 

1 

ORGCOD 

10 

/♦Organization  Code 

V 

1 

ORGNAM 

30 

/♦Name  of  Organization 

V 

1 

BUDGET 

6 

/♦Budget  of  Organization 

V 

JOBREC 

/♦Job  Record 

V 

1 

JOBCOD 

10 

/♦Job  Code 

V 

1 

JOBTTL 

60 

/♦Job  Title 

V 

1 

ATHQNT 

6 

/♦Authorized  Quantity 

*/ 

1 

SALRGC 

6 

/♦Salary  R.G.  Count 

V 

1 

SALRPG 

/♦Salary  Repeating  Group 

V 

OCCURS  SALRGC  TIMES 

2 

ATHSAL 

6 

/♦Authorized  Salary 

*/ 

2 

ATHMAX 

6 

/♦Authorized  Maximum 

*/ 

2 

ATHMIN 

6 

/♦Authorized  Minimum 

V 

2 

DDUCTN 

15 

/♦Deduction 

V 

OCCURS  2 TIMES 

1 

EMPCNT 

6 

1 

EMPNUM 

25 

/♦Employees  on  the  Job 

V 

OCCURS  EMPCNT  TIMES 

EMPREC 

/♦Employee  Record 

V 

1 

EMPNUM 

25 

/♦Employee  Number 

V 

1 

EMPNAM 

30 

/♦Employee  Name 

V 

1 

EMPADR 

100 

/♦Employee  Address 

V 

1 

EMPBRT 

8 

/♦Date  of  Birth 

V 

1 

EMPMST 

2 

/♦Marital  Status 

V 

1 

EMPLVL 

2 

/♦Seniority  Level 

V 

1 

JHSYRS 

2 

/♦Years  with  organization 

1*/ 

1 

JHSCNT 

6 

/♦Job  History  Count 

V 

1 

JOBCOD 

10 

/♦Jobs  Held  with  Org’n. 

V 

OCCURS  JHSCNT  TIMES 


Figure  5.1  Data  Base  Description  for  a Storage  Limit  of 
1.35  M Bytes,  PGSZ  = 2000  Bytes  and  MSPG  = 2 


.1 


236 


NO. 

NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

4 

OFRREC 

/♦Offering 

Record 

V 

1 

OFRCOD 

25 

/♦Offering 

Code 

V 

1 

OFRFMT 

3 

/♦Format 

V 

1 

OFRDAT 

8 

/♦Date 

V 

1 

OFRLOC 

10 

/♦Location 

*/ 

1 

TCHCNT 

6 

/♦Count  of 

Teachers 

V 

1 

TCHNUM 

25 

/♦Teachers 

V 

OCCURS  TCHCNT  TIMES 


1 

COUNUM 

15 

/♦Course  Number 

V 

1 

STUCNT 

6 

/♦Count  of  Students 

V 

1 

STUNUM 

25 

/♦Students 

V 

OCCURS  STUCNT  TIMES 


5 COUREC  /*Course  Record  */ 


1 

COUNUM 

15 

/♦Course 

Number 

V 

1 

COUTTL 

60 

/♦Course 

Title 

V 

1 

COUDSP 

250 

/♦Course 

Description 

*/ 

Set  Descriptions: 

NO.  NAME  OWNER  MEMBER  OPTIONS 

1 JOBOFORG  ORGREC  JOBREC  CHAIN 

2 EMPOFORG  ORGREC  EMPREC  CHAIN 


Figure  5.1  (Cont'd.) 


237 


types  called  ORGREC,  JOBREC,  EMPREC,  OFRREC  and  COUREC. 

The  first  record  type,  ORGREC,  is  the  "Organization  Record" 
type  that  contains  data  item  types  ORGCOD,  ORGNAM  and  BUDGET; 
thereby  implementing  data  base  relations  NAMOFORG  and 
BGTOFORG  defined  as  first  and  second  data  base  relations  R^ 
and  R2  in  SR,  respectively.  The  second  record  type,  JOBREC, 
is  a record  type  describing  jobs.  It  contains  data  items 
JOBCOD,  JOBTTL,  ATHQNT,  EMPCNT  and  EMPNUM,  defined  at 
level  1 (of  record  description)  and  a repeating  group  called 
SALRPG  (salary  repeating  group)  whose  components  ATHSAL, 
ATHMAX,  ATHMIN  and  DDUCTN  are  defined  at  level  2.  EMPNUM 
is  a vector  of  variable  length  whose  length  is  given  by 
EMPCNT  and  for  each  JOBREC  record  EMPNUM  contains  employee 
numbers  of  employees  holding  that  job.  The  presence  of 
this  vector  in  the  JOBREC  record  types  constitutes  the 
implementation  of  the  relation  called  JOBOFEMP  by  implemen- 
tation 4 (variable  duplication  of  EMPNUM  under  JOBCOD) . 

The  occurrences  of  EMPNUM  values  in  these  records  are 
duplicate  copies  in  contrast  to  the  shared  occurrences  in 
instances  of  the  EMrREC  (employee  record)  record  type.  The 
latter  record  type  consists  of  data  item  types  EMPNUM, 

EMPNAM,  EMPADR,  EMPBRT,  EMPMST,  EMPLVL,  JHSYRS  and  JHSCNT 
and  a vector  of  job  codes  called  JOBCOD  of  variable  length 
given  by  the  value  of  JHSCNT.  This  vector  records  a list 
of  codes  of  jobs  the  employee  has  held  with  the  organization. 
The  juxtaposition  of  the  above  data  items  to  form  the 
EMPREC  record  type  implements  the  following  data  base 


238 


relations:  NAMOFEMP,  ADROFEMP,  BRTOFEMP,  MSTOFEMP,  LVLOFEMP, 
YRSOFEMP  and  JHSOFEMP. 

The  fourth  record  type  is  called  OFRREC  (offerings 
record). and  it  consists  of  data  item  types  OFRCOD,  OFRFMT, 
OFRDAT,  OFRLOC,  COUNUM,  TCHCNT  and  STUCNT  and  vectors 
TCHNUM  and  STUNUM  of  lengths  given  by  the  contents  of 
TCHCNT  and  STUCNT,  respectively.  The  first  four  data  items 
contain  the  code,  format,  date  and  location  of  the  offering, 
respectively.  The  fifth  data  item  COUNUM  gives  the  course 
number  of  the  offering  and  the  two  vectors  TCHNUM  and 
STUNUM  contain  employee  numbers  of  teachers  and  students 
of  the  offering,  respectively. 

The  last  record  type  COUREC  (course  record)  contains 
data  items  COUNUM,  COUTTL  and  COUDSP  containing  the  number, 
title  and  description  of  the  course  respectively. 

There  are  two  data  base  set  types  defined  in  Figure 
5.1  called  JOBOFORG  and  EMPOFORG.  The  owner  of  both  set 
types  is  ORGREC  record  type  and  members  of  the  two  sets  are 
JOBREC  and  EMPREC,  respectively.  Sets  of  both  types  will 
be  implemented  by  chains  with  next  pointers  (without  owner 
or  prior  pointers) . Each  ORGREC  occurrence  is  the  owner  of 
a data  base  set  (of  type  EMPOFORG)  whose  member  records  are 
records  (of  type  EMPREC)  of  employees  working  for  the  organ- 
ization. Each  ORGREC  is,  also,  the  owner  of  a data  base 
set  (of  type  JOBOFORG)  whose  member  records  are  records 
(of  type  JOBREC)  of  jobs  within  the  organization.  Data  base 
set  types  EMPOFORG  and  JOBOFORG  represent  the  implementation 


239 


of  relations  EMPOFORG  and  JOBOFORG,  respectively  by  imple- 
mentation 9 (chain  with  next  pointers  association  of  EMPNUM 
and  JOBCOD  under  ORGCOD,  respectively) . 

This  particular  design  for  the  data  base  requires 
1,348,350  characters  of  storage  and  a small  reduction  in  the 
storage  limit  results  in  another  design  becoming  optimal, 
namely  the  one  presented  in  Figure  5.5.  The  difference 
between  the  two  designs  is  that  in  the  second  one  (Figure 
5.5)  the  relation  called  OFROFCOU  is  implemented  by  imple- 
mentation 13  which  is  the  chain  with  next  and  prior  pointers 
association  of  OFRCOD  under  COUNUM  which  accounts  for  the 
set  type  OFROFCOU  defined  in  Figure  5.5.  Whereas  in  the 
previous  design  (Figure  5.1),  the  relation  OFROFCOU  is 
implemented  by  implementation  2 (fixed  duplication  of  COUNUM 
under  OFRCOD) , which  accounts  for  the  presence  of  the  data 
item  COUNUM  in  the  definition  of  OFRREC  record  type  (Figure 
5.1).  The  configuration  associated  with  the  data  base 
description  of  Figure  5.5  has  a cumulative  storage  cost  of 
1,348,250  characters  and  a cumulative  time  cost  of  55,570 
in  expected  page  fault  rate. 

Larger  reductions  in  the  storage  space  limit  result  in 
different  data  base  designs.  For  example,  for  a storage 
limit  of  1.3  M characters  the  structure  of  Figure  5.2 
becomes  optimal,  although  at  a higher  time  cost  of  58,186 
in  expected  page  fault  rate.  The  data  base  description  of 
Figure  5.2  is  different  from  that  of  Figure  5.1  in  the 
following  ways.  Relation  JOBOFEMP  was  implemented  by 


240 


Record  Descriptions: 


NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

ORGREC 

/♦Organization  Record 

V 

1 

ORGCOD 

10 

/♦Organization  Code 

V 

1 

ORG NAM 

30 

/♦Name  of  Organization 

V 

1 

BUDGET 

6 

/♦Budget  of  Organization 

V 

JOBREC 

1 

JOBCOD 

10 

/♦Job  Record 
/♦Job  Code 

V 

1 

JOBTTL 

60 

/♦Job  Title 

V 

1 

ATHQNT 

6 

/♦Authorized  Quantity 

V 

1 

SALRGC 

6 

/♦Salary  R.G.  Count 

V 

1 

SALRPG 

/♦Salary  Repeating  Group 

*/ 

OCCURS  SALRGC  TIMES 

2 

ATHSAL 

6 

/♦Authorized  Salary 

V 

2 

ATHMAX 

6 

/♦Authorized  Maximum 

V 

2 

ATHMIN 

6 

/♦Authorized  Minimum 

V 

2 

DDUCTN 

15 

/♦Deductions 

V 

OCCURS  2 TIMES 

EMPREC 

/♦Employee  Record 

V 

1 

EMPNUM 

25 

/♦Employee  Number 

V 

1 

EMPNAM 

30 

/♦Name  of  Employer 

V 

1 

EMPADR 

100 

/♦Address  of  Employee 

V 

1 

EMPBRT 

8 

/♦Date  of  Birth 

V 

1 

EMPMST 

2 

/♦Marital  Status 

V 

1 

EMPLVL 

2 

/♦Seniority  Level 

V 

1 

JHSYRS 

2 

/♦Years  with  Organization 

IV 

1 

JHSCNT 

6 

/♦Job  History  Count 

V 

1 

JOBCOD 

10 

/♦Jobs  Held  with  Org'n. 

V 

OCCURS  JHSCNT  TIMES 

Figure  5.2  Data  Base  Description  for  a Storage  Limit  of 
1.3  M Bytes,  TGSZ  = 2000  Bytes  and  MSPG  = 2 


241 


NO. 

NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

3 

EMPREC 

/♦Cont ' d 

V 

1 

JOBCOD 

10 

/*Job  of  Employee 

V 

1 

OFRCNT 

6 

/♦Offering  Count 

V 

1 

OFRCOD 

25 

/♦Student  in 

*/ 

OCCURS  OFRCNT  TIMES 

4 

OFRREC 

/♦Offering  Record 

*/ 

1 

OFRCOD 

25 

/♦Offering  Code 

V 

1 

OFRFMT 

3 

/♦Format 

V 

1 

OFRDAT 

8 

/♦Date 

V 

1 

OFRLOC 

10 

/♦Location 

V 

1 

TCHCNT 

6 

/♦Count  of  Teachers 

V 

1 

TCHNUM 

25 

/♦Teachers 

V 

‘ OCCURS  TCHCNT  TIMES 

1 

COUNUM 

15 

/♦Course  Number 

*/ 

5 

COUREC 

/♦Course  Record 

*/ 

1 

COUNUM 

15 

/♦Course  Number 

V 

1 

COUTTL 

60 

/♦Course  Title 

*/ 

1 

COUDSP 

250 

/♦Course  Description 

V 

Set  Descriptions: 

NO.  NAME  OWNER  MEMBER  OPTIONS 


1 

2 


JOBOFORG 

EMPOFORG 


ORGREC  JOBREC  CHAIN 

ORGREC  EMPREC  CHAIN 


rrr* 


Figure  5.2  (Cont'd.) 


242 


Record  Descriptions: 


NO. 


1 


2 


3 


NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

ORGREC 

/♦Organization  Record 

V 

1 

ORGCOD 

10 

/♦Organization  Code 

*/ 

1 

ORGNAM 

30 

/♦Name  of  Organization 

V 

1 

BUDGET 

6 

/♦Budget  of  Organization 

V 

JOBREC 

/♦Job  Record 

V 

1 

JOBCOD 

10 

/♦Job  Code 

V 

1 

JOBTTL 

60 

/♦Job  Title 

V 

1 

ATHQNT 

6 

/♦Authorized  Quantity 

V 

1 

SALRGC 

6 

/♦Salary  R.G.  Count 

V 

1 

SALRBG 

/♦Salary  Repeating  Group 

V 

OCCURS  SALRGC  TIMES 

2 

ATHSAL 

6 

/♦Authorized  Salary 

V 

2 

ATHMAX 

6 

/♦Authorized  Maximum 

V 

2 

ATHMIN 

6 

/♦Authorized  Minimum 

V 

2 

DDUCTN 

15 

/♦Deductions 

V 

OCCURS  2 TIMES 

EMPREC 

/♦Employee  Record 

V 

1 

EMPNUM 

25 

/♦Employee  Number 

V 

1 

EMPNAM 

30 

/♦Name  of  Employee 

V 

1 

EMPADR 

100 

/♦Address  of  Employee 

V 

1 

EMPBRT 

8 

/♦Date  of  Birth 

V 

1 

EMPMST 

2 

/♦Marital  Status 

V 

1 

EMPLUL 

2 

/♦Seniority  Level 

V 

1 

JHSYRS 

2 

/♦Years  with  Organization 

i*/ 

1 

JHSCNT 

6 

/♦Job  History  Count 

V 

1 

JOBCOD 

10 

/♦Jobs  Held  with  Org'n. 

V 

OCCURS  JHSCNT  TIMES 


Figure  5.3  Data  Base  Description  for  a Storage  Limit  of 
1, 207,050  Bytes,  PGSZ  = 2000  Bytes  and  MSPG  = 2 


243 


NO. 

NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

4 

OFRREC 

/•Offering 

Record 

V 

1 

OFRCOD 

25 

/•Offering 

Code 

V 

1 

OFRFMT 

3 

/•Format 

V 

1 

OFRDAT 

8 

/•Date 

V 

1 

OFRLOC 

10 

/•Location 

V 

1 

STUCNT 

6 

/•Count  of 

Students 

V 

1 

STUNUM 

25 

/•Employee 

Nos.  of 

Stdnts*/ 

OCCURS  STUCNT  TIMES 

5 

COUREC 

/•Course 

Record 

V 

1 

COUNUM 

15 

/•Course 

Number 

V 

1 

COUTTL 

60 

/•Course 

Title 

V 

1 

COUDSP 

250 

/•Course 

Description 

V 

Set  Descriptions: 


NO. 

NAME 

OWNER 

MEMBER 

OPTIONS 

1 

JOBOFORG 

ORGREC 

JOBREC 

CHAIN 

2 

EMPOFORG 

ORGREC 

EMPREC 

CHAIN 

3 

JOBOFEMP 

JOBREC 

EMPREC 

CHAIN 

4 

TEACHERS 

OFRREC 

EMPREC 

CHAIN 

5 

OFROFCOU 

COUREC 

OFRREC 

CHAIN 

Figure  5.3  (Cont'd.) 


244 


Record  Descriptions: 


NO. 


1 


2 


3 


NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

ORGREC 

/♦Organization  Record 

V 

1 

ORGCOD 

10 

/♦Organization  Code 

V 

1 

ORGNAM 

30 

/♦Name  of  Organization 

V 

1 

BUDGET 

6 

/♦Budget  of  Organization 

V 

JOBREC 

/♦Job  Record 

V 

1 

JOBCOD 

10 

/♦Job  Code 

V 

1 

JOBTTL 

60 

/♦Job  Title 

V 

1 

ATHQNT 

6 

/♦Authorized  Quantity 

V 

1 

SALRGC 

6 

/♦Salary  R.G.  Count 

V 

1 

SALRPG 

/♦Salary  Repeating  Group 

*/ 

OCCURS  SALRGC  TIMES 

2 

ATHSAL 

6 

/♦Authorized  Salary 

*/ 

2 

ATHMAX 

6 

/♦Authorized  Maximum 

V 

2 

ATHMIN 

6 

/♦Authorized  Minimum 

V 

2 

DDUCTN 

15 

/♦Deductions 

*/ 

OCCURS  2 TIMES 

EMPREC 

/♦Employee  Record 

V 

1 

EMPNUM 

25 

/♦Employee  Number 

*/ 

1 

EMPNAM 

30 

/♦Name  of  Employee 

V 

1 

EMPADR 

100 

/♦Address  of  Employee 

V 

1 

EMPBRT 

8 

/♦Date  of  Birth 

V 

1 

EMPMST 

2 

/♦Marital  Status 

V 

1 

EMPLUL 

2 

/♦Seniority  Level 

V 

1 

JHSYRS 

2 

/♦Years  with  Organization*/ 

1 

JHSCNT 

6 

/♦Job  History  Count 

V 

1 

JOBCOD 

10 

/♦Jobs  Held  with  Org’n. 

V 

OCCURS  JHSCNT  TIMES 


Figure  5.4  Data  Base  Description  for  a Storage  Limit  of 

1,227,650  Bytes,  PGSZ  = 2000  Bytes  and  MSPG  = 2 


245 


NO . NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

4 OFRREC 

/•Offering  Record 

V 

1 

OFRCOD 

25 

/•Offering  Code 

•/ 

1 

OFRFMT 

3 

/•Format 

•/ 

1 

OFRDAT 

8 

/•Date 

•/ 

1 

OFRLOC 

10 

/•Location 

*/ 

1 

STUCNT 

6 

/•Count  of  Students 

*/ 

1 

STUNUM 

25 

/•Students 

*/ 

OCCURS  STUCNT  TIMES 

5 COUREC 

/•Course 

Record 

*/ 

1 

COUNUM 

15 

/•Course 

Number 

•/ 

1 

COUTTL 

60 

/•Course 

Title 

*/ 

1 

COUDSP 

250 

/•Course 

Description 

•/ 

Set  Descriptions: 


NO. 

NAME 

OWNER 

MEMBER 

OPTIONS 

1 

JOBOFORG 

ORGREC 

JOBREC 

CHAIN 

2 

EMPOFORG 

ORGREC 

EMPREC 

CHAIN 

3 

JOBOFEMP 

JOBREC 

EMPREC 

CHAIN  WITH 

OWNER 

POINTERS 

4 

TEACHERS 

OFRREC 

EMPREC 

CHAIN 

5 

OFROFCOU 

COUREC 

OFRREC 

CHAIN  WITH 

PRIOR 

POINTERS 

Figure  5. 

4 (Cont' 

d.) 

! 


i 


246 


Record  Descriptions : 


NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

ORGREC 

/♦Organization  Record 

V 

1 

ORGCOD 

10 

/♦Organization  Code 

V 

1 

ORGNAM 

30 

/♦Name  of  Organization 

V 

1 

BUDGET 

6 

/♦Budget  of  Organization 

V 

JOBREC 

/♦Job  Record 

V 

1 

JOBCOD 

10 

/♦Job  Code 

V 

1 

JOBTTL 

60 

/♦Job  Title 

V 

1 

ATHQNT 

6 

/♦Authorized  Quantity 

V 

1 

SALRGC 

6 

/♦Salary  R.G.  Count 

*/ 

1 

SALRPG 

/♦Salary  Repeating  Group 

V 

OCCURS  SALRGC  TIMES 

2 

ATHSAL 

6 

/♦Authorized  Salary 

V 

2 

ATHMftX 

6 

/♦Authorized  Maximum 

V 

2 

ATHMIN 

6 

/♦Authorized  Minimum 

V 

2 

DDUCTN 

15 

/♦Deductions 

*/ 

OCCURS  2 TIMES 

1 

EMPCNT 

6 

1 

EMPNUM 

25 

/♦Employees  on  the  Job 

V 

OCCURS  EMPCNT  TIMES 

EMPREC 

/♦Employee  Record 

V 

1 

EMPNUM 

25 

/♦Employee  Number 

V 

1 

EMPNAM 

30 

/♦Employee  Name 

V 

1 

EMPADR 

100 

/♦Employee  Address 

V 

1 

EMPBRT 

8 

/♦Date  of  Birth 

V 

1 

EMPMST 

2 

/♦Marital  Status 

V 

1 

EMPLVL 

2 

/♦Seniority  Level 

V 

1 

JHSYRS 

2 

/♦Years  with  Organization 

>*/ 

1 

JHSCNT 

6 

/♦Job  History  Count 

V 

1 

JOBCOD 

10 

/♦Jobs  Held  with  Org'n. 

V 

OCCURS  JHSCNT  TIMES 


Figure  5.5  Data  Base  Description  for  a Storage  Limit  of 

1,  34 0; 2 50  Bytes,  PGSZ  = 2000  Bytes  and  MSPG  = 2 


r 


247 


NO.  NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

4 OFRREC 

/♦Offering  Record 

V 

1 

OFRCOD 

25 

/♦Offering  Code 

V 

1 

OFRFMT 

3 

/♦Format 

V 

1 

OFRDAT 

8 

/♦Date 

V 

1 

OFRLOC 

10 

/♦Location 

*/ 

1 

TCHCNT 

6 

/♦Count  of  Teachers 

V 

1 

TCHNUM 

25 

/♦Teachers 

V 

OCCURS  TCHCNT  TIMES 

1 

STUCNT 

6 

/♦Count  of  Students 

V 

1 

STUNUM 

25 

/♦Students 

V 

OCCURS  STUCNT  TIMES 

5 

COUREC 

/♦Course 

Record 

V 

1 

COUNUM 

15 

/♦Course 

Number 

V 

1 

COUTTL 

60 

/♦Course 

Title 

*/ 

1 

COUDSP 

250 

/♦Course 

Description 

*/ 

Set  Descriptions: 


NO. 

NAME 

OWNER 

MEMBER 

OPTIONS 

1 

JOBOFORG 

ORGREC 

JOBREC 

CHAIN 

2 

EMPOFORG 

ORGREC 

EMPREC 

CHAIN 

3 

OFROFCOU 

COURER 

OFRREC 

CHAIN  WITH 

NEXT  AND  PRIOR 
POINTERS 

Figure  5. 

5 (Cont'd 

.) 

implementation  4 in  Figure  5.1  and  therefore  we  had  a 
variable  length  vector  of  employee  numbers  EMPNUM  (length 
given  by  EMPCNT  in  JOBREC)  in  the  JOBREC  record  type, 
whereas  in  Figure  5.2  this  relation  is  implemented  by  imple- 
mentation 1 and  therefore  JOBREC  record  type  does  not  con- 
tain EMPNUM  vector  and  EMPCNT  counter,  instead  a JOBCOD 
data  item  in  the  EMPREC  record  type  that  will  contain  the 
job  code  of  the  employee)  is  defined  to  implement  JOBOFEMP 
relation.  We  recall  that  a)  implementation  4 for  relation 
JOBOFEMP  is  the  variable  duplication  of  EMPNUM  under  JOBCOD 
and  b)  implementation  1 for  JOBOFEMP  is  the  fixed  duplica- 
tion of  JOBCOD  under  EMPNUM,  where  EMPNUM  = ORG (JOBOFEMP) 
and  JOBCOD  = DST (JOBOFEMP) . 

The  relation  called  STUDENTS  is  also  implemented 
differently  in  the  two  designs.  In  the  data  base  design 
described  in  Figure  5.1  this  relation  is  implemented  by 
implementation  3;  the  variable  duplication  of  EMPNUM  under 
OFRCOD , where  OFRCOD  = ORG (STUDENTS ) and 

EMPNUM  - DST (STUDENTS) . Therefore  in  Figure  5.1  a variable 
length  vector  of  employee  numbers  called  STUNUM  is  included 
in  the  definition  of  OFRREC  and  the  length  of  the  vector  is 
given  by  the  value  of  a counter  data  item  type  called 
STUCNT  also  defined  in  OFRREC  record  type.  On  the  other 
hand,  relation  STUDENTS  is  implemented  by  implementation  4, 
variable  duplication  of  OFRCOD  under  EMPNUM,  in  the  data 
base  design  described  in  Figure  5.2.  As  can  be  seen  in 
Figure  5.2,  a variable  length  vector  of  offering  codes 


249 


called  OFRCOD  is  included  in  the  definition  of  EMPREC 
record  type;  the  length  of  the  vector  is  given  by  the 
content-s  of  another  data  item  type,  called  OFRCNT,  also 
included  in  the  definition  of  EMPREC  record  type.  The 
elements  of  the  vector  OFRCOD  in  a particular  employee's 
record  (type  EMPREC)  contain  offering  codes  of  offerings 
in  which  that  particular  employee  is  registered  as  a 
student. 

Further  reductions  in  the  storage  space  limit  result 
in  different  data  base  designs  with  higher  time  costs.  If 
the  storage  limit  is  reduced  to  1,227,650  characters,  the 
data  base  design  described  in  Figure  5.4  becomes  optimal  and 
it  has  a cumulative  time  cost  of  76,846  in  expected  page 
fault  rate.  This  design  includes  five  data  base  set  type 
definitions.  It  is  different  from  the  data  base  design  of 
Figure  5.1  in  the  following  ways.  The  data  base  relation 
called  JOBOFEMP  is  implemented,  in  Figure  5.4,  by  imple- 
mentation 12  (called  chain  with  next  and  owner  pointers 
association  ot  EMPNUM  under  JOBCOD,  where 

EMPNUM  = ORG (JOBOFEMP)  and  JOBCOD  = DST ( JOBOFEMP) ) . That 
is  why,  in  Figure  5.4,  a data  base  set  type  called  JOBOFEMP 
is  defined  (set  type  number  3)  whose  owner  and  member  record 
types  are  JOBREC  and  EMPREC  record  types,  respectively,  and 
as  can  be  seen  in  the  Figure,  the  OWNER  POINTER  option  is 
specified  in  the  definition  of  the  set  type.  Whereas  in  the 
data  base  design  of  Figure  5.1,  relation  JOBOFEMP  was  imple- 
mented by  implementation  4 and  thereby  employee  numbers  of 

J 


250 


employees  holding  a job  were  duplicated  in  a variable 
length  vector  in  the  JOBREC  occurrence  of  that  job.  The 
implementations  of  the  relation  called  TEACHERS  is  also 
different  in  the  two  data  base  designs.  Employee  numbers 
of  teachers  of  a given  offering  (a  specific  offering  code) 
were  stored  in  the  elements  of  a vector  called  TCHNUM  in 
the  OFRREC  record  of  the  offering,  in  the  data  base  descrip- 
tion of  Figure  5.1,  because  TEACHERS  relation  was  implemen- 
ted by  implementation  3 . Whereas  in  the  design  shown  in 
Figure  5.4,  this  relation  is  implemented  by  implementation 
9 (chain  with  next  pointers  association  of  EMPNUM  under 
OFRCOD) . This  is  the  reason  why  a data  base  set  type 
called  TEACHERS  is  defined  in  Figure  5.4.  The  owner  and 
member  record  types  of  this  set  type  are  OFRREC  and  EMPREC 
record  types,  respectively.  Each  occurrence  of  the  TEACHERS 
set  type  consists  of  exactly  one  record  of  OFRREC  type, 
corresponding  to  some  OFRCOD  value,  and  zero  or  more 
record (s)  of  EMPREC  type,  corresponding  to  employees  who 
teach  that  offering.  Finally,  the  relation  called  OFROFCOU, 
implemented  by  implementation  2 in  the  data  base  design  of 
Figure  5.1,  is  now  implemented  by  implementation  13  (chain 
with  next  and  prior  pointers  association  of  OFRCOD  under 
COUNUM) . We  recall  that  OFRCOD  = DST (OFROFCOU)  and 
COUNUM  = ORG (OFROFCOU ) . In  the  data  base  design  of  Figure 
5.1,  for  each  record  of  type  OFRREC  (corresponding  to  an 
offering  code  OFRCOD),  the  associated  course  number  is 
duplicated  in  the  data  item  COUNUM  which  is  included  in  the 


251 


definition  of  OFRREC . However,  in  the  data  base  design  of 
Figure  5.4,  since  OFROFCOU  is  implemented  by  implementation 
13,  a data  base  set  type  is  defined  with  owner  and  member 
record  types  declared  to  be  COUREC  and  OFRREC,  respectively 
(set  type  number  5 in  Figure  5.4). 

If  we  continue  reducing  the  storage  space  limit,  pro- 
gressively more  costly  data  base  designs  become  optimal. 
However,  if  the  limit  is  reduced  to  a value  below  1,207,050 
characters,  then  there  are  no  acceptable  solutions  simply 
because  the  sum  of  storage  costs  of  the  least  costly  (in 
terms  of  storage)  implementation  for  each  relation 
is  1,207,050.  The  data  base  design  which  is  optimal  for  a 
storage  limit  of  1,207,050  is  described  in  Figure  5.3. 
Actually  for  the  storage  limit  of  1,207,050  characters,  this 
is  the  only  acceptable  solution  and  hence  it  is  optimal. 

The  data  base  design  described  in  Figure  5.3  is  different 
from  the  one  in  5.4  (discussed  above)  in  the  set  implemen- 
tation technique  option  for  set  types  3 and  5.  Set  types 
number  3 and  5 in  Figure  5.4  are  defined  with  the  OWNER 
POINTERS  and  PRIOR  POINTERS  options  specified,  respectively. 
Whereas  in  the  design  of  Figure  5.3  these  two  data  base 
sets  (JOBOFEMP  and  OFROFCOU)  use  the  simple  chain  with  next 
pointers  only, in  the  occurrences  of  the  set  types.  The 
data  base  design  of  Figure  5.3  has  an  average  record  length 
of  211.5  characters. 

Increasing  the  storage  limit,  on  the  other  hand,  also 
results  in  new  data  base  designs  becoming  optimal.  For 


252 


example,  increasing  the  limit  to  1,485,350  characters 
causes  the  data  base  described  in  Figure  5.6  to  become 
optimal.  We  note  that  in  this  design  of  the  data  base  no 
set  types  are  defined.  Increasing  the  storage  space  limit 
beyond  1,485,350  characters,  however,  does  not  change  the 
optimal  design  because  there  is  no  acceptable  configuration 
whose  time  cost  is  less  than  the  time  cost  of  the  configur- 
ation associated  with  the  data  base  design  shown  in  Figure 
5.6. 


5.3  Storage/Time  Trade-Off 

In  the  preceding  section  we  observed  that  for  a given 
application — namely  a given  set  of  data  items,  data  base 
relations,  user  activities  and  storage  space  parameters — a 
different  data  base  design  becomes  optimal  when  the  limit 
on  storage  space  is  varied  over  a certain  range  of  values 
S . and  S . For  example,  in  the  case  of  the  example 
presented  in  Section  5.2,  the  range  of  storage  limits  was 
from  1,207,050  to  1,485,350  characters.  There  are  certain 
storage  limit  values  in  this  range  where  a new  data  base 
design  becomes  optimal.  We  will  call  these  points  on  the 
storage  space  scale,  breakpoints.  The  breakpoints  on  the 
storage  space  limit  values  correspond,  one-to-one,  to  the 
sequence  of  undominated  solutions  of  the  final  stage  of 
optimization.  We  know  from  Chapter  4 that  if  this  sequence 
is  ordered  in  increasing  order  of  storage  costs,  then  it  is 
also  ordered  in  the  strictly  decreasing  order  of  time  costs. 


253 


Record  Descriptions: 


1 ORGREC 


JOBREC 


LEVEL 

ITEM 

SIZE 

COMMENT 

/♦Organization  Record 

*/ 

1 

ORGCOD 

10 

/♦Organization  Code 

V 

1 

ORGNAM 

30 

/♦Name  of  Organization 

V 

1 

BUDGET 

6 

/♦Budget  of  Organization 

V 

1 

JOBCNT 

6 

/♦Count  of  Jobs 

V 

1 

JOBCOD 

10 

/♦Jobs  in  the  Org'n. 

V 

OCCURS  JOBCNT  TIMES 

1 

EMPCNT 

6 

/♦Count  of  Employees 

V 

1 

EMPNUM 

25 

/♦Employees  of  the  Org’n. 

V 

OCCURS  EMPCNT  TIMES 

/♦Job  Record 

V 

1 

JOBCOD 

10 

/♦Job  Code 

V 

1 

JOBTTL 

60 

/♦Job  Title 

V 

1 

ATHQNT 

6 

/♦Authorized  Quantity 

V 

1 

SALRGC 

6 

/♦Salary  R.G.  Count 

V 

1 

SALRPG 

/♦Salary  Repeating  Group 

V 

OCCURS  SALRGC  TIMES 


2 

ATHSAL 

6 

/♦Authorized 

Salary 

V 

2 

ATHMAX 

6 

/♦Authorized 

Maximum 

V 

2 

ATHMIN 

6 

/♦Authorized 

Minimum 

V 

2 

DDUCTN 

15 

/♦Deduction 

V 

OCCURS  2 TIMES 


EMPCNT 

EMPNUM 


25  /*Employees  on  the  Job  */ 


OCCURS  EMPCNT  TIMES 


Figure  5.6  Data  Base  Description  for  a Storage  Limit  of 
1,485,350  (or  more)  Bytes,  PGSZ  = 2000  and 
MSPG  = 2 


254 


NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

EMPREC 

/♦Employee  Record 

V 

1 

EMPNUM 

25 

/♦Employee  Number 

V 

1 

EMPNAM 

30 

/♦Employee  Name 

V 

1 

EMPADR 

100 

/♦Employee  Address 

V 

1 

EMPBRT 

8 

/♦Date  of  Birth 

V 

1 

EMPMST 

2 

/♦Marital  Status 

V 

1 

EMPLVL 

2 

/♦Seniority  Level 

V 

1 

JHSYRS 

2 

/♦Years  with  Organization 

L*/ 

1 

JHSCNT 

6 

/♦Job  History  Count 

V 

1 

JOBCOL 

10 

/♦Jobs  Held  with  Org'n. 

V 

OCCURS  JHSCNT  TIMES 

OFRREC 

/♦Offering  Record 

*/ 

1 

OFRCOD 

25 

/♦Offering  Code 

*/ 

1 

OFRFMT 

3 

/♦Format 

*/ 

1 

OFRDAT 

8 

/♦Date 

V 

1 

OFRLOC 

10 

/♦Location 

*/ 

1 

TCHCNT 

6 

/♦Count  of  Teachers 

V 

1 

TCHNUM 

25 

/♦Teachers 

*/ 

OCCURS  TCHCNT  TIMES 

1 

COUNUM 

15 

/♦Course  Number 

*/ 

1 

STUCNT 

6 

/♦Count  of  Students 

V 

1 

STUNUM 

25 

/♦Students 

V 

OCCURS  STUCNT  TIMES 

COUREC 

/♦Course 

Record 

V 

1 

COUNUM 

15 

/♦Course 

Number 

V 

1 

COUTTL 

60 

/♦Course 

Title 

*/ 

1 

COUDSP 

250 

/♦Course 

Description 

V 

Figure  5.6  (Cont'd.) 


255 


For  each  storage  space  limit  on  the  range  between  two  con- 
secutive breakpoints  3^  < Sk+1,  including  and  excluding 
Sk+1,  the  configuration  (and,  hence  the  data  base  design) 
associated  with  is  optimal,  k = 1,  ...»  NSS  where  NSS  is 
the  number  of  solutions  to  the  final  stage  of  optimization 

and  S . _ = S,  < S-,  < ...  < SMOC  = S 

min  1 2 NSS  max 

We  can  illustrate  the  relationship  between  storage  and 
time  costs  of  optimal  configurations  in  a diagram  such  as 
the  one  shown  in  Figure  5.7.  In  this  diagram  the  vertical 
axis  represents  time  costs  in  expected  number  of  page  faults 


and  the  horizontal  axis  represents  storage  costs  in  number 
of  characters  of  storage.  Each  discontinuity  point  on  the 
graph  corresponds  to  a breakpoint  on  the  storage  space 
limit.  There  are  a total  of  45  breakpoints  on  the  graph 
shown  in  Figure  5.7  on  a range  of  storage  costs  from 
1,207,050  to  1,485,35  characters.  Time  costs  vary  from 
135,161  to  52,492  in  expected  number  of  oage  faults.  It 
can  be  seen  from  the  Figure  that  a sharp  r’ sc  occurs  in 
time  costs  when  storage  limit  is  reduced  to  below  1,227,050 
characters.  This  rise  of  about  56,889  units  in  time  cost 
when  storage  is  reduced  to  below  1,227,050  characters,  is 
due  mainly  to  the  difference  in  time  costs  of  implementa- 
tions 10  and  12  for  relation  JOBOFEMP.  Implementation  10 
for  this  relation  has  a time  cost  of  90,086  and  implemen- 
tation 12  has  a time  cost  of  31,807  units.  Implementation 
12,  we  recall,  is  similar  to  implementation  10  except  for 
the  extra  owner  pointers  that  are  present  in  implementation 


^1.20  1.25  1.30  1.35  1.40  1.4S 

STORAGE  COST  (M  BYTES) 

Figure  5.7  Storage/Time  Trade-off  Graph 


1.50 


257 


12.  This  is  an  example  of  the  case  where  a rather  small 
storage  overhead  (20,000  characters)  for  the  owner  pointers 
in  a specific  set  type  can  make  a relatively  large  saving 
in  time. costs  (58,279  units). 

The  storage/time  trade-off  graph  exemplified  in  Figure 
5.7  should  be  extended  infinitely  to  the  right  along  the 
horizontal  line  through  the  point  on  the  graph  correspond- 
ing to  S = CS  (C.,)  . This  is  because  all  storage  limits 
max  M 

greater  than  S will  result  in  the  same  configuration  C„ 
max  r M 

to  become  optimal  and  the  (cumulative)  time  cost  of  this 

configuration  CT(CM)  is  actually  the  lowest  time  cost 

achievable  under  any  storage  limit.  Similarly  since  no 

acceptable  configuration  is  attainable  which  has  a storage 

cost  less  than  S = CS(C  ),  we  associate  a time  cost  of 
min  m 

infinity  with  such  configurations.  This  is  demonstrated 
graphically  by  extending  the  storage/time  trade-off  graph 
upward  on  the  vertical  line  through  the  point  on  the  graph 
corresponding  to  Sm^n. 

The  idea  behind  choosing  a two-component  cost  function 
was  to  avoid  the  assessment  of  a relative  measure  of  impor- 
tance of  time  costs  vs.  storage  costs  which  may  not  be 
known, at  least  at  the  time  the  data  base  is  first  designed. 
If,  however,  such  information  is  available  (possibly  at  a 
data  base  reorganization  point)  it  can  be  used  in  conjunc- 
tion with  the  storage/time  trade-off  graph  to  satisfy  both 
objectives.  In  the  simplest  case,  for  example,  one  can 
assume  some  dollar  cost  a for  a unit  of  time  cost  and  b for 


258 


a unit  of  storage  cost  which  amounts  to  a linear  additive 
measure  of  cost  such  as  COST  = aS  + bT  for  S units  of  stor- 
age cost  and  T units  of  time  cost.  In  this  case  points  of 
equal  COST  lie  on  straight  lines  whose  slopes  depend  on 
a:b  ratio  and  are  located  farther  from  the  origin  for 
higher  COSTS.  The  designer  can,  then,  easily  find  the  mini- 
mum COST  point  on  the  trade-off  graph  by  superimposing  it 
with  parallel  lines  of  constant  COST.  The  underlying 
assumption  here  is  that  the  storage  space  limit  is  not  a 
severe  design  constraint  and  the  designer  has  some  flexi- 
bility over  the  adjustment  of  this  parameter  in  the  model. 

In  the  next  section  we  will  examine  the  effect  of 
changing  various  parameters  of  the  model  (mainly  storage 
related  parameters)  on  the  outcome  of  the  optimization 
process.  At  this  point,  however,  we  present  the  optimal 
design  of  the  data  base  for  a different  set  of  parameter 
values,  which  is  structurally  different  from  other  data 
base  designs  in  this  section. 

For  the  following  values  of  parameters  and  a storage 
space  limit  of  1,194,000  characters  the  data  base  design 
of  Figure  5.8  becomes  optimal. 

PTRS  = 3 characters;  DRSZ  = 4*PTRS  = 12  characters, 

MSPG  = 2 pages,  PGSZ  = 2,000  characters, 

CNTRS  = 6 characters 

PLOKS  = 20  characters. 

The  difference  between  these  values  of  parameters  as 
compared  to  those  for  Figures  5.1  to  5.6  is  in  the  value  of 


259 

Record  Descriptions 

: 

NO. 

NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

1 

ORGREC 

/♦Organization  Record 

V 

1 

ORGCOD 

10 

/♦Organization  Code 

V 

1 

ORGNAM 

30 

/♦Name  of  Organization 

V 

1 

BUDGET 

6 

/♦Budget  of  Organization 

V 

2 

JOBREC 

/♦Job  Record 

V 

1 

JOBCOD 

10 

/♦Job  Code 

V 

1 

JOBTTL 

60 

/♦Job  Title 

V 

1 

ATHQNT 

6 

/♦Authorized  Quantity 

*/ 

1 

SALRGC 

6 

/♦Salary  R.G.  Count 

V 

1 

SALRPG 

/♦Salary  Repeating  Group 

V 

OCCURS  SALRGC  TIMES 

2 

ATHSAL 

6 

/♦Authorized  Salary 

*/ 

2 

ATHMAX 

6 

/♦Authorized  Maximum 

*/ 

2 

ATHMIN 

6 

/♦Authorized  Minimum 

V 

2 

DDUCTN 

15 

/♦Deductions 

V 

OCCURS  2 TIMES 

3 

EMPREC 

/♦Employee  Record 

V 

1 

EMPNUM 

25 

/♦Employee  Number 

V 

1 

EMPNAM 

30 

/♦Name  of  Employee 

V 

1 

EMPADR 

100 

/♦Address  of  Employee 

V 

1 

EMPBRT 

8 

/♦Date  of  Birth 

*/ 

1 

EMPMST 

2 

/♦Marital  Status 

V 

1 

EMPLVL 

2 

/♦Seniority  Level 

V 

1 

JHSYRS 

2 

/♦Years  with  Organization*/ 

1 

JHSCNT 

6 

/♦Job  History  Count 

V 

1 

JOBCOD 

10 

/♦Jobs  Held  with  Org'n. 

V 

OCCURS  JHSCNT  TIMES 

Figure  5.8 

Data  Base  Description  for  a Storage  Limit  of 

l'194'OOO  Bytes, 

PGSZ  = 2000  Bytes  and  MSPG  = 

2 

yr -J 

260 


NO. 

NAME 

LEVEL 

ITEM 

SIZE 

COMMENT 

4 

OFRJREC 

/♦Offering  Record 

V 

1 

OFRCOD 

25 

/♦Offering  Code 

V 

1 

OFRFMT 

3 

/♦Format 

*/ 

1 

OFRDAT 

8 

/♦Date 

V 

i 

OFRLOC 

10 

/♦Location 

*/ 

5 

COUREC 

/♦Course  Record 

V 

1 

COUNUM 

15 

/♦Course  Number 

*/ 

1 

COUTTL 

60 

/♦Course  Title 

♦/ 

1 

COUDSP 

250 

/♦Course  Description 

V 

6 STULNK  /* Student  Link  Record;  This  is  a Dummy*/ 

/♦Record  and  Contains  no  Data  */ 


Set  Descriptions: 


NO. 

NAME 

OWNER 

MEMBER 

OPTIONS 

1 

JOBOFORG 

ORGREC 

JOBREC 

CHAIN 

2 

EMPOFORG 

ORGREC 

EMPREC 

CHAIN 

3 

JOBOFEMP 

JOBREC 

EMPREC 

CHAIN 

WITH 

OWNER 

POINTERS 

4 

TEACHERS 

OFRREC 

EMPREC 

CHAIN 

5 

OFROFCOU 

COUREC 

OFRREC 

CHAIN 

WITH 

PRIOR 

POINTERS 

6 

STUDNTOF 

EMPREC 

STULNK 

CHAIN 

WITH 

OWNER 

POINTERS 

7 

STUDENTS 

OFRREC 

STULNK 

CHAIN 

WITH 

OWNER 

POINTERS 

Figure  5.3  (Cont'd.) 


261 


the  pointer  size  PTRS  which  is  reduced  from  4 to  3 charac- 
ters. This  reduction  in  the  size  of  an  address  occurrence 
makes  implementation  21  less  costly  in  terms  of  storage 
because  dummy  records  of  this  implementation  are  totally 
made  up  of  address  occurrences.  We  recall  that  implementa- 
tion 21  for  a relation  such  as  R(A,  B)  , defined  in  3.1.11, 
is  called  the  Dummy  Record  Association  of  A and  B,  and  that 
this  implementation  can  be  used  to  implement  R(A,  B)  even 
if  the  relation  is  a many-to-many  relation.  One  example  of 
such  a relation  is  the  23rd  relation  defined  in  the  appli- 
cation described  in  Chapter  2 and  it  is  called  STUDENTS. 

The  origin  of  the  relation  is  OFRCOD  and  the  destination  of 
the  relation  is  EMPNUM,  and  it  relates  employee  numbers  of 
employees  registered  as  students  to  corresponding  offerings. 
The  relation  is  generally  a many-to-many  relation  because  a 
student  may  be  registered  in  many  offerings  while  at  the 
same  time  many  students  are  registered  for  a given  offering. 
The  data  base  design  described  in  Figure  5.8  is  to  some 
extent  similar  to  the  one  shown  in  Figure  5.4  and  therefore 
we  only  mention  their  differences. 

The  source  of  difference  between  the  two  data  base 
designs  is  in  the  implementation  of  STUDENTS  relation.  In 
the  design  described  in  Figure  5.4  this  relation  is  imple- 
mented by  variable  duplication  of  EMPNUM  under  OFRCOD 
(implementation  3)  whereas  in  the  data  base  design  of  Figure 
5.8  the  STUDENTS  relation  is  implemented  by  the  dummy  record 
association  of  EMPNUM  and  OFRCOD  (implementation  21).  This 


necessitates,  in  the  design  shown  in  Figure  5.8,  the  defi- 
nition of  a new  record  type  called  STULNK  (student  link 
record) . No  data  item  types  are  defined  as  components  of 
this  record  type  and  occurrences  of  this  record  type  are 
composed  only  of  pointers  that  establish  their  membership 
in  certain  data  base  sets.  There  are  also  two  new  data 
base  set  types  defined  in  Figure  5.8,  called  STUDNTOF  and 
STUDENTS.  The  STULNK  record  type  is  declared  as  the  member 
record  type  in  both  set  types  and  EMPREC  and  OFRREC  record 
types  are  declared  as  owner  record  types  of  STUDNTOF  and 
STUDENTS  set  types,  respectively.  Sets  of  these  two  types 
are  to  be  implemented  by  chaining  STULNK  member  records  to 
each  other  and  to  the  respective  member  records  through 
next  and  owner  pointers.  Each  employee  record  (type : EMPREC) 
is  the  owner  of  a (possibly  empty,  if  the  employee  is  not 
registered  as  a student  in  any  offering)  data  base  set  of 
STUDNTOF  type.  Member  records  of  this  set  are  dummy  re, cords 
of  STULNK  type  and  each  one  of  these  member  records  also 
participates  in  exactly  one  data  base  set  of  STUDENTS  type 
whose  owner,  in  turn,  is  a record  of  type  OFRREC.  In  this 
fashion  each  STULNK  record  corresponds  to  exactly  one 
ordered  pair  <a,  b>  in  the  relation  STUDENTS,  where  a and  b 
are  OFRCOD  and  EMPNUM  values.  This  record  is  a member  of 
exactly  one  set  of  STUDNTOF  type  whose  owner  contains  b and 
it  is  also  a member  in  exactly  one  set  of  STUDENTS  type 
whose  owner  record  contains  a.  Storage/time  trade-off  graph 
for  MSPG  = 2,  PGSZ  = 2,000  characters  and  PTRS  = 3 charac- 


263 

.ters  is  given  in  Figure  5.9. 

5. 4 Effects  of  Some  Parameter  Variations 

In  this  section  we  will  examine  the  effect  of  changing 
some  of  the  parameters,  especially  in  the  storage  space 
model,  on  the  outcoming  designs  and  particularly  on  the 
storage/time  trade-off  graphs  discussed  in  the  previous 
section. 

The  effect  of  changing  MSPG,  the  number  of  pages  of 
data  that  may  be  in  main  storage  at  the  same  time,  while 
keeping  all  other  parameters  fixed  is  shown  in  Figure  5.10. 
This  Figure  shows  the  storage/time  graph  for  PGSZ  = 2,000 
characters,  PTRS  = 4 characters,  CNTRS  = 6 characters  and 
MSPG  =1,  2,  4 and  5 pages.  A smaller  portion  of  the  four 
graphs  have  been  redrawn  with  different  scales  (the  scale 
of  vertical  axis  increased  to  400%,  and  that  of  horizontal 
axis  reduced  to  80%  of  the  original  scales) . This  graph 
shows  that  for  fixed  PGSZ  an  increase  in  the  number  of  main 
storage  pages,  MSPG,  results  in  the  overall  reduction  of 
time  costs.  The  reduction  is,  however,  not  significant 
when  MSPG  is  increased  from  1 to  2 or  from  4 to  5 . The 
range  of  time  costs  in  the  four  graphs  for  MSPG  =1,  2,  4 
and  5 is  52,553  to  135,602,  52,492  to  135,161,  52,354  to 
134,269  and  52,300  to  133,833,  respectively. 


The  observation  that  for  a fixed  storage  limit,  the 
time  cost  of  the  optimal  configuration  decreases  when  MSPG 
is  increased  is  not  surprising  because  higher  MSPG  values 


266 


account  for  larger  page  buffer  sizes  (for  fixed  PGSZ) . 

Larger  PGSZ  values,  in  turn,  make  it  possible  to  keep 
higher  percentages  of  data  in  main  storage,  hence  fewer 
page  faults  are  expected  to  occur  in  those  cases. 

Figure  5.11  shows  the  storage/time  graphs  for 
PGSZ  = 4,000  characters  and  MSPG  = 1,  2,  4 and  5.  Graphs 
shown  in  this  Figure  have  the  same  characteristics  as  the 
graphs  shown  in  Figure  5.10,  namely  that,  for  fixed  PGSZ, 
increasing  MSPG  results  in  an  overall  decrease  of  time 
costs.  Graphs  of  Figure  5.11  are  relatively  farther  apart 
with  respect  to  those  shown  in  Figure  5.10.  This  is 
because  page  size  values  used  in  graphs  of  Fig.  5.11  are 
larger  than  those  of  Fig.  5.10.  The  conclusion  here  is 
that  for  higher  page  sizes,  time  cost  values  are  more  sensi- 
tive to  changes  in  MSPG. 

Individual  configurations  are  usually  identical  for 
all  eight  graphs.  We  recall  that  each  discontinuity  point 
on  a graph  corresponds  to  a breakpoint  in  the  storage  cost 
axis  which  in  turn  corresponds  to  a specific  configuration. 
The  graphs  of  Figures  5.10  and  5.11  show  that  these  dis- 
continuities almost  always  occur  at  the  same  abscissae. 

* 

Comparing  Figures  5.10  and  5.11  one  notices,  also,  that 
each  graph  for  PGSZ  = 4,000  falls  below  that  for 
PGSZ  = 2,000  for  the  same  MSPG.  In  other  words,  for  a 
fixed  number  of  available  pages  MSPG,  larger  page  sizes 
result  in  lower  overall  page  faults,  which  is  to  be  expec- 
ted. This  is  more  clearly  illustrated  in  Figures  5.12 


267 


Figure  5.11  Trade-off  Graphs  for  Different  MSPG  Values 
where  PGSZ  = 4000  Characters 


268 


through  5.15  where  graphs  of  equal  MSPG  values  but  differ- 
ent PGSZ  values  are  drawn  together.  It  should  be  noted, 
however,  that  we  are  not  taking  into  consideration  the 
higher  transport  time  for  larger  pages. 

Another  observation  is  that  the  product  (rather  than 
individual  values)  of  PGSZ  and  MSPG  controls  the  relative 
positioning  of  the  storage/time  graphs.  This  can  be  seen 
by  comparing  the-  graphs  for  (MSPG  = 2,  PGSZ  = 2£00)  and 
(MSPG  = 1,  PGSZ  = 4,000);  or  those  for  (MSPG  = 4, 

PGSZ  = 2,000)  and  (MSPG  = 2,  PGSZ  = 4,000).  Figure  5.16 
illustrates  this  point.  In  Figure  5.16  the  storage/time 
graphs  are  given  for  the  three  combinations  (MSPG  = 1, 

PGSZ  = 2,000),  (MSPG  = 5,  PGSZ  = 2,000)  and  (MSPG  = 5, 

PGSZ  = 4,000).  Note  that  the  page  buffer  size  PBSZ  is 
equal  to  2,000,  10,000  and  20,000  characters,  respectively, 
for  the  above  three  combinations. 

In  order  to  investigate  the  effect  of  variations  in 
the  pointer  size  parameter  PTRS,  we  hold  all  other  param- 
eter values, used  for  the  storage/time  trade-off  graph  of 
Figure  5. 7, constant  and  generate  the  graph  for  PTRS  values 
of  6,  8 and  16  characters.  Note  that  we  have  shown  the 
individual  graphs  of  the  two  cases  where  PTRS  = 3 and 
PTRS  =4,  in  Figures  5.7  and  5.9,  respectively.  Figure 
5.17  shows  all  five  graphs  on  the  same  set  of  coordinates. 

We  have  already  seen  one  of  the  effects  of  reducing  the 
pointer  size,  namely  that  implementations  which  make  exten- 
sive use  of  pointers  become  less  expensive  (in  terms  of 


•Ml 


TIME  COST  tlOOO  UNITS) 

60.00  70.00  80.00  30.00  100.00  110.00  120.00  130.00 


270 


fa 


A MSPG=  2 PGSZ=  2000 
B MSPG=  2 PGSZ=  4000 


b 


Figure 


STORAGE  COST  (M  BYTES) 


5.13  Trade-off  Graphs  for  Different  PGSZ  Values 
where  MSPG  = 2 


TIME  COST  (1000  UNITS) 

SO. 00  60.00  70.00  80.00  90.00  100.00  110.00  120.00  130.00 


271 


;1 


Figure  5.14  Trade-off  Graphs  for  Different  PGSZ  Values 
where  MSPG  *=  4 


272 


Figure  5.15  Trade-off  Graphs  for  Different  PGSZ  Values 
where  MSPG  = 5 

[ 

I 


*•20  1.2S  1.30  1.33  1.H0  1.1(5  1.50 

STORAGE  COST  (M  BYTES) 


Figure  5.16  Trade-off  Graphs  for  Different  XSPG  and  PGSZ 
Values 


iT*T» 


275 


storage  cost)  when  the  pointer  size  is  reduced.  For 
example,  we  observed  in  Section  5.3  that  the  reduction  of 
PTRS  from  4 to  3 characters  resulted  in  the  design  of  Figure 
5.8  to  become  optimal  for  a storage  limit  of  1,194,000 
characters.  This  design,  as  we  discussed  in  Section  5.3, 
makes  use  of  implementation  21  which  is  a relatively  costly 
implementation  in  terms  of  storage  space. 

Figure  5.17  illustrates  that  increasing  PTRS  causes 
the  storage/time  graph  to  move  to  the  right.  This  is 
because  data  base  designs  that  use  data  base  sets  (thereby 
containing  pointers)  ha^e  higher  storage  costs  for  the  same 
configurations.  Of  c , if  a data  base  design  is  such 

that  it  does  not  include  set  type  definitions,  then  its 
storage  cost  is  not  affected  by  changes  in  PTRS.  This  is 
why  the  lower  right  sides  of  all  5 graphs  in  Figure  5.17 
coincide.  The  upper  left  sides  of  each  one  of  these  graphs, 
however,  extend  upwards  along  the  storage  cost  of  their 
respective  smallest  data  base  designs,  which  is  also  the 
lower  bound  on  the  storage  limit  for  each  graph.  We  observe 
that  these  lower  bounds  are  different  and  actually  for  a 
fixed  storage  limit,  an  optimal  configuration  may  not  exist 
if  PTRS  is  not  small  enough.  For  example,  if  PTRS  = 8 
characters  then  no  optimal  configuration  exists  if  the 
storage  limit  is  set  below  1,272,450  characters,  while  for 
the  same  storage  limit  an  optimal  configuration  exists  for 
each  of  the  smaller  PTRS  values,  namely  6,  4 and  3. 

The  final  observation  about  the  pointer  size  variation 


276 


is  that  the  number  of  optimal  configurations  decreases  as 
PTRS  increases.  This  is  because, for  higher  PTRS  values, 
the  lower  bound  on  storage  limit  increases  while  the  upper 
bound  (storage  cost  of  the  most  costly  optimal  configura- 
tion) stays  the  same,  and  this  in  turn  excludes  some  config- 
urations (mainly  those  that  include  many  set  types)  from 
the  set  of  optimal  configurations. 


CHAPTER  VI 


CONCLUSIONS  AND  FURTHER  RESEARCH 
6.1  Conclusions 

In  this  dissertation  we  have  introduced  a methodology 
that  partially  automates  the  design  of  the  logical  structure 
of  a paged  data  base.  This  methodology  is  intended  for  use 
by  a data  base  designer  to  automatically  evaluate  and  com- 
pare a large  number  of  logical  structures  that  can  model 
the  information  content  of  a given  set  of  user  requirements. 
This  was  accomplished  by  mapping  a high  level  description 
of  user  requirements  into  a precisely  formulated  dynamic 
programming  problem,  whose  solution  can  be  easily  translated 
into  a set  of  record  structures,  record  relationships  and 
storage  structures.  Structural  relationships  among  data 
elements  were  constrained  to  conform  to  the  specifications 
of  the  CODASYL  DBTG  report  [CODASYL  1971J.  The  performance 
objective  used  to  evaluate  alternative  structures  was  the 
expected  number  of  page  fault  occurrences  for  a given  set 
of  operations. 

A model  was  presented  in  Chapter  2 which  is  capable  of 
describing  a specific  data  management  application.  Com- 
ponents of  this  model  are  data  items,  data  base  relations 
and  storage  space  parameters.  Data  items  and  data  base 
relations  were  characterized  by  their  names,  sizes  and 
cardinalities  rather  than  values  of  actual  occurrences. 


277 


278 


Storage  space  parameters  include  such  things  as  storage 
space  limit,  memory  page  size,  pointer  size,  etc.  Expected 
frequencies  of  data  retrieval  and  update  were  specified 
using  a set  of  12  data  manipulation  operations  defined  as  a 
part  of  this  model. 

Various  implementation  alternatives  for  a given  data 
base  relation  were  described  in  Chapter  3.  The  selection  of 
an  implementation  for  a relation  determines  whether  the 
component  data  items  of  the  relation  are  associated  in  the 
data  base  by  physical  adjacency  or  some  other  technique 
using  pointers.  Constraint  conditions  were  defined  in 
Chapter  3 to  govern  the  selection  process  such  that  security 
requirements  of  the  application  could  be  satisfied  and  also 
structures  unconformable  to  DBTG  specifications  would  be 
avoided.  It  was  further  observed  that  even  after  filtering 
the  implementation  alternatives  for  each  individual  relation, 
using  the  constraint  conditions,  there  might  still  be  several 
acceptable  implementations  for  each  relation.  For  any  rea- 
sonable number  of  relations  in  an  application,  the  number  of 
possible  configurations  can  be  a very  large  number.  In  order 
to  evaluate  and  select  the  "best"  configuration,  a two- 
component  cost  function  was  defined  for  all  acceptable  rela- 
tion implementations  in  Chapter  3.  The  first  cost  component 
was  defined  as  the  storage  cost  and  the  second  component  was 
defined  as  the  time  cost  of  the  relation  implementation. 

Four  categories  of  page  faults  were  defined  and  some  approxi- 


279 


"mate  formulas  were  given  for  the  computation  of  a page  fault 
in  each  category.  Time  costs  were  given  in  terms  of  these' 

i 

probabilities  and  various  other  parameters  describing  re- 
lations and  implementations. 

In  Chapter  4 we  formulated  an  optimization  problem  to 
select  exactly  one  implementation  for  each  relation,  based 
on  models  described  in  previous  chapters.  Further  investi- 
gation of  the  properties  of  our  particular  optimization 
problem  led  to  an  efficient  algorithm  based  on  the  notions 
of  "undominated  choices"  and  "undominated  solutions."  Those 
properties  of  undominated  choices  (and  solutions)  that  were 
utilized  to  improve  the  efficiency  of  the  algorithm  were 
investigated  in  a formal  manner.  Time  complexities  of  the 
critical  stages  of  the  algorithm  were  also  briefly  analyzed.  ! 

In  Chapter  5 we  discussed  various  data  base  designs 
that  resulted  from  applying  our  design  methodology  to  an 
example  application  described  in  Chapter  2.  Results  of  a 
sensitivity  analysis  for  time  cost  functions  with  respect  to 
certain  parameters  were  reported  in  Chapter  5. 

6.2  Further  Research 

Several  areas  of  possible  extensions  were  suggested  in 
various  sections  of  this  dissertation,  especially  in  Chapters 
1 and  3.  We  will  now  summarize  some  of  the  areas  where  fur- 
ther research  may  prove  to  be  fruitful. 

Some  DBTG  data  model  concepts  not  included  in  our  models 
can  be  added,  with  different  degrees  of  difficulty.  These 

A- 


280 


'concepts  were  enumerated  in  Chapter  1 and  include  such 
things  as  the  modelling  of  REALMS,  INDEXing,  SINGULAR  SETs, 
CALL  KEY,  SEARCH  KEY,  SUB-SCHEMAs , LOCATION  MODE,  VIRTUAL 
data  items,  ENCODING/DECODING  and  set  membership  classes. 
Incorporation  of  some  of  these  concepts  requires  some  kind 
of  file  organization  analyses  for  which  successful  results 
reported  by  Yao  [1974]  or  Severance  [1972]  may  be  utilized. 

As  pointed  out  in  Chapter  3,  the  set  of  implementation 
alternatives  defined  in  that  chapter  is  not  considered  to  be 
exhaustive  of  all  conceivable  ways  of  implementing  a data 
base  relation.  Some  possible  extensions  to  the  implementa- 
tion alternatives  were  suggested  and  briefly  discussed  at 
the  end  of  Chapter  3. 

The  application  of  the  optimization  algorithm  (of  Chap- 
ter 4)  to  the  example  problem  (of  Chapter  2)  showed  that  a 
significant  saving  could  be  achieved  in  processing  time  of 
the  second  phase  of  the  algorithm  by  discarding  "dominated 
choices"  in  the  first  phase.  However,  it  is  theoretically 
possible  that  the  second  phase  of  the  algorithm  could  be 
faced  with  many  stages  for  each  of  which  several  undominated 
choices  exist.  If  at  the  same  time  cost  components  are  such 
that  the  second  phase  of  the  algorithm  behaves  in  the  worst 
case,  then  the  actual  number  of  steps  in  its  execution  may 
become  very  large.  In  case  such  extreme  conditions  prove 
to  be  imminent,  some  modifications  in  the  definitions  of 
dominance  between  pairs  of  choices  or  solutions  may  be  made 


281 


to  improve  the  efficiency  of  the  algorithm.  Recall  from 
Chapter  4 that  q = <MPL,  S,  Y>  is  a choice  for  a relation 
R if  MPL  is  an  acceptable  implementation  for  R and  S and 
T are  storage  and  time  costs,  respectively,  of  MPL  for  R. 

By  our  definition  of  dominance,  if  = <9,  100,  100>  and 

I 

q2  = <17,  99,  10  000>  are  two  choices  for  a relation  R, 
then  (q^,  q2  ) is  an  undominated  set  of  choices  for  R,  al- 
though the  time  cost  of  q2  is  much  higher  than  that  of  q^ 
versus  a very  small  advantage  of  q2  over  q^  in  storage  costs. 
If  some  kind  of  relative  importance  of  storage  and  time 
osts  were  assumed,  then  the  notion  of  dominance  could  be 
generalized  to,  say,  functional  dominance  where  a set  of 
functions  (maybe  two)  would  be  utilized  to  decide  whether 
or  not  two  choices  (or  solutions)  were  undominated.  Proper- 
ties of  this  kind  of  dominance  must  also  be  investigated  and 
employed  in  the  corresponding  optimization  algorithm (s) . 

The  effect  of  concurrency  on  data  access  interferences 
has  not  been  treated  in  this  research.  Concurrent  access  of 
data  bases  by  multiple  users  result  in  contention  problems 
studied,  for  example,  by  Collmeyer  [1971]  and  Shemer  [1972]. 
Overhead  expenditures  required  to  avoid  or  resolve  such 
contention  problems  may  considerably  affect  our  cost  computa- 
tions. Further  research  is  necessary  before  such  concepts 
can  be  incorporated  into  our  models. 

An  important  class  of  data  base  processing  operations 
is  the  sequential  (batched)  access.  This  class  of  operations 


282 


often  arises  in  data  base  processing  for  report  generation, 
sorting , merging,  etc.  Our  data  manipulation  operations 
can  be  extended  so  that  sequential  operations  may  be  speci- 
fied. 

Time  cost  functions  for  such  operations  must  be  de- 
veloped and  their  effects  on  other  parts  of  the  overall 
model  must  be  analyzed  so  that  the  results  may  be  incor- 
porated into  our  data  base  design  methodology. 

Extensions  to  our  models  by  the  incorporation  of  more 
DBTG  features  were  enumerated  earlier  in  this  section.  We 
will  nov;  briefly  discuss  the  degrees  of  difficulty  in  adding 
some  of  those  features. 

In  order  to  accomodate  the  SINGULAR  SET  concept,  it 
should  be  noted  that  for  each  data  item  in  a record  one  such 
set  may  be  defined.  Once  the  set  implementation  technique 
for  these  sets  is  fixed  then  one  can  add  their  storage  costs 
to  the  duplication  and  aggregation  implementations.  Time 
cost  functions  will  be  affected  through  Pi  probabilities. 
Inclusion  of  this  feature  is  reasonably  easy  except  possibly 
in  the  cases  where  the  data  items  involved  are  parts  of  vec- 
tors or  repeating  groups. 

INDEXING  and  SEARCH  KEY  concepts  are  basically  not 
difficult  to  include.  It  should,  however,  be  noted  that 
since  each  set  may  or  may  not  have  any  number  of  possible 
indexing  techniques,  the  total  size  of  the  problem  may  grow 
very  fast. 


283 


Extensions  concerning  the  representation  of  data  items, 
such  as  VIRTUAL  and  ENCODING/DECODING  concepts  were  dis- 
cussed at  the  end  of  Chapter  3 (page  169). 

Inclusion  of  the  concept  of  LOCATION  MODE  (of  records) 
would  require  some  changes  in  our  overall  model.  Basically, 
it  requires  the  user  requirements  model  to  be  changed  par- 
ticularly in  the  definitions  of  operations  if  all  location 
modes  are  to  be  accounted  for.  It  is  clear  that  any 
changes  in  those  operations  will  propagate  to  time  cost 
formulations  of  Chapter  3. 


REFERENCES 


[Bachman  1974]  Bachman,  C.W.,  "Implementation  Techniques 

for  Data  Structure  Sets",  Data  Base  Management  Systems 
edited  by  D . A . Jardine,  Proc.  of  the  SHARE  Working 
Conference  on  DBMS's,  Montreal,  Canada,  July  23-27, 
1973,  North-Holland,  American  Elsevier,  1974. 

[Beck  1976]  Beck,  Leland  L. , "An  Approach  to  the  Creation 
of  Structured  Data  Processing  Systems",  Proceedings, 
1976  SIGMOD,  Page  179,  International  Conference  on 
Management  of  Data,  Washington,  D.C.,  June  2-4,  Edited 
by  J.B.  Rothnie. 

[Bellman  1957]  Bellman,  R.E.,  "Dynamic  Programming", 

Princeton  University  Press,  Princeton,  New  Jersey, 

1957,  (A  Rand  Corporation  Research  Study) . 

[Bellman  1962]  Bellman,  R.E.  and  Dreyfus,  S.E.,  "Applied 
Dynamic  Programming",  Princeton  University  Press. 

[Berg  1976]  Berg,  John  L.  (ed.),  "Data  Base  Directions: 

The  Next  Steps",  DATA  BASE,  Vol . 3,  No.  2,  Fall  1976 
(A  quarterly  Newsletter  of  SIGBDP);  SIGMOD  RECORD, 

Vol.  8,  No.  4,  Nov.  1976,  Proceedings  of  the  Workshop 
of  the  NBS  and  ACM  held  at  Fort  Lauderdale,  Fla., 

Oct.  29-31,  1975. 

[Bubenko  1976]  Bubenko,  J.A. , "From  Information  Structures 
to  DBTG-Data  Structures",  Proceedings  of  Conference  on 
Data:  Abstraction,  Definition  and  Structure,  March 
1976,  Salt  Lake  City,  Utah,  F.D.T.,  Bulletin  of 
ACM-SIGMOD,  V.  8,  N.  2,  1976,  pp.  73-85. 

[CODASYL  1971]  CODASYL,  "Data  Base  Task  Group",  Report, 
CODASYL,  April  1971. 

[CODASYL  1973a]  CODASYL,  "CODASYL  COBOL  Data  Base  Facility 
Proposal",  CODASYL,  Data  Base  Language  Task  Group, 

March  1973. 

[CODASYL  1973b]  CODASYL,  "Journal  of  Development",  CODASYL, 
Data  Description  Language  Committee,  June  1973. 

[Collmeyer  1971]  Collmeyer,  A.J.,  "Database  Management  in 
a Multi-Access  Environment",  Computer,  Vol.  4,  No.  6, 
Nov. /Dec . /1971 , pp.  36-46. 

[Cullinane]  Cullinane,  "Integrated  Data  Base  Management 
System,  DDL  and  DML" , Cullinane  Corporation. 

[Date  1975]  Date,  C.J.,  "An  Introduction  to  Database 
Systems",  The  Systems  Programming  Series,  Addison 
Wesley,  1975. 


284 


r 


285 


[DEC]  DEC,  "DBMS  Programmer's  Procedures  Manual",  Digital 
Equipment  Corporation,  Document  DEC-10-APPMA-A-D. 

[Gerritsen  1975]  Gerritsen,  Rob,  "A  Preliminary  System  for 
the  Design  of  DBTG  Data  Structures",  CACM,  October 
1975,  Vol.  18,  Number  10,  pp.  551-557. 

[GUIDE-SHARE  1970]  GUIDE-SHARE,  "Data  Base  Management 

System  Requirements",  Data  Base  Requirements  Group, 
Guide-Share,  Nov.  1970. 

[Honeywell]  Honeywell,  "Honeywell  Series  600/6000  Inte- 
grated Data  Store",  Reference  Manual,  Document  BR69. 

[Knuth  1968]  Knuth,  D.E.,  "The  Art  of  Computer  Programming", 
Vol.  1:  "Fundamental  Algorithms",  Addison-Wesley , 
Reading,  Mass.,  1968. 

[Knuth  1973]  Knuth,  D.E.,  "The  Art  of  Computer  Programming", 
Volume  3:  "Sorting  and  Searching",  Addison-Wesley, 
Reading,  Mass.,  1973. 

[Mitoma  1975]  Mitoma,  M.F.,  "Optimal  Data  Base  Schema 
Design",  Ph.D.  Thesis,  University  of  Michigan. 

[Severance  1972]  Severance,  D.G.,  "Some  Generalized 

Modeling  Structures  for  Use  in  Design  of  File  Organi- 
zations", Ph.D.  Thesis,  University  of  Michigan,  1972. 

[Shemer  1972]  Shemer,  J.E.  and  Collmeyer,  A.J.,  "Database 
Sharing:  A Study  of  Interference , Roadblock  and  Dead- 
lock", Proceedings  of  1972  ACM-SIGFIDET  Workshop, 

DATA  Description  Access  and  Control,  Edited  by 

A. L.  Dean,  Denver,  Colorado,  Nov.  29-30,  Dec.  1/1972, 

pp.  147-162. 

[Taylor  1976]  Taylor,  R.W.  and  Frank,  R.L.,  "CODASYL  Data- 
Base Management  Systems",  ACM-Computing  Surveys: 

Volume  8,  Number  1,  March  1976,  pp.  67-103. 

[Tuel  1976]  Tuel,  W.G.,  Jr.,  "An  Analysis  of  Buffer  Paging 
in  Virtual  Storage  Systems",  IBM,  J.  Res.  Develop, 

Vol.  20,  No.  5,  September  1976,  p.  518. 

[UNIVAC  1974]  UNIVAC,  "Data  Manipulation  Language"  UP-7992 
and  "Schema  Definition"  UP  7907,  SPERRY  UNIVAC  1100 
Series  Data  Management  System  (DMS  1100)  . 

[Yao  1974]  Yao,  Shi  Bing,  "Evaluation  and'  Optimization  of 
File  Organizations  through  Analytic  Modelling",  Ph.D. 
Thesis,  University  of  Michigan,  1974. 


\ i 

i MISSION  j 

$ °f  \ 

$ Rome  Air  Development  Center  j 


RADC  plans  and  conducts  research,  exploratory  and  advanced 
development  programs  in  command , control , and  communications 
(C3)  activities,  and  in  the  C3  areas  of  information  sciences 
and  intelligence.  The  principal  technical  mission  areas 
are  communications,  electromagnetic  guidance  and  control, 
surveillance  of  ground  and  aerospace  objects,  intelligent* 
data  collection  and  handling.  Information  system  technology, 
ionospheric  propagation,  solid  state  sciences,  microwave 
physics  and  electronic  reliability,  maintainability  and 
compatibility. 


